Martin Atzmueller Dominik Benz Andreas Hotho Gerd Stumme (Eds.) LWA 2010 Lernen, Wissen & Adaptivita¨t Workshop Proceedings Kassel, October 4 – 6, 2010 Kasseler Informatikschriften (KIS): 2010/5 Knowledge and Data Engineering Group, University of Kassel, Germany Table of Contents Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Workshop KDML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Lifted Conditioning for Pairwise Marginals and Beyond . . . . . . . . . . . . . . . . . . . . 13 Babak Ahmadi, Kristian Kersting, and Fabian Hadiji. The Smart-TSH-Finder: Crawling and Analyzing Tempo-Spatial Hotspots in Second Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Akram Al-Kouz, Ernesto William De Luca, Jan Clausen, and Sahin Albayrak. Stream-based Community Discovery via Relational Hypergraph Factorization on Evolving Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Christian Bockermann and Felix Jungermann. Listing closed sets of strongly accessible set systems with applications to data . . 33 Mario Boley, Tamas Horvath, Axel Poigne. and Stefan Wrobel. Mining Music Playlogs for Next Song Recommendations . . . . . . . . . . . . . . . . . . . 35 Andre Busche, Artus Krohn-Grimberghe, and Lars Schmidt-Thieme. Graded Multilabel Classification: The Ordinal Case . . . . . . . . . . . . . . . . . . . . . . . 39 Weiwei Cheng, Krzysztof Dembczynski, and Eyke Hu¨llermeier. Bootstrapping Noun Groups Using Closed-Class Elements Only . . . . . . . . . . . . . 47 Kathrin Eichler and Gu¨nter Neumann. Towards Adjusting Mobile Devices to User’s Behaviour . . . . . . . . . . . . . . . . . . . . 51 Peter Fricke, Felix Jungermann, Katharina Morik, Nico Piatkowski, Olaf Spinczyk, and Marco Stolpe. Workflow Analysis using Graph Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Natalja Friesen and Stefan Rueping. Visually summarizing the Evolution of Documents under a Social Tag . . . . . . . . 67 Andre´ Gohr, Myra Spiliopoulou, and Alexander Hinneburg. One Clustering Process Fits All - A Visually Guided Ensemble Approach . . . . . . 69 Martin Hahmann, Dirk Habich, Maik Thiele, and Wolfgang Lehner. Efficient frequent connected subgraph mining in graphs of bounded tree-width . . 75 Tamas Horvath and Jan Ramon. On the Need of Graph Support for Developer Identification in Software Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Aftab Iqbal and Marcel Karnstedt. Separate-and-conquer Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Frederik Janssen and Johannes Fu¨rnkranz. Kernelized Map Matching for noisy trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Ahmed Jawad and Kristian Kersting. Convex NMF on Non-Convex Massiv Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Kristian Kersting, Mirwaes Wahabzada, Christian Thurau, and Christian Bauckhage. Quantitatives Frequent-Pattern Mining u¨ber Datenstro¨men . . . . . . . . . . . . . . . . . . 105 Daniel Klan and Thomas Rohe. A Novel Multidimensional Framework for Evaluating Recommender Systems . . 113 Artus Krohn-Grimberghe, Alexandros Nanopoulos, and Lars Schmidt-Thieme. An Evaluation of Multilabel Classification for the Automatic Annotation of Texts 121 Eneldo Loza Mencı´a. Pattern Mining in Sparse Temporal Domains, an Interpolation Approach . . . . . . . 125 Christian Po¨litz. SVM Classifier Estimation from Group Probabilities . . . . . . . . . . . . . . . . . . . . . . 129 Stefan Rueping. Fast-Ensembles of Minimum Redundancy Feature Selection . . . . . . . . . . . . . . . . 137 Benjamin Schowe and Katharina Morik. Potenzial des Data Mining fu¨r Ressourcenoptimierung mobiler Gera¨te im Krankenhaus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Rene Schult and Bastian Kurbjuhn. Probability Estimation and Aggregation for Rule Learning . . . . . . . . . . . . . . . . . . 143 Jan-Nikolas Sulzmann and Johannes Fu¨rnkranz. Conditional Random Fields For Local Adaptive Reference Extraction . . . . . . . . . 151 Martin Toepfer, Peter Kluegl, Andreas Hotho, and Frank Puppe. Towards Understanding the Changing Web: Mining the Dynamics of Linked-Data Sources and Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Ju¨rgen Umbrich, Marcel Karnstedt, and Sebastian Land. Counting-based Output Prediction for Orphan Screening . . . . . . . . . . . . . . . . . . . 163 Katrin Ullrich, Christoph Stahr, and Thomas Ga¨rtner. Workshop IR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Language Models, Smoothing, and IDF Weighting . . . . . . . . . . . . . . . . . . . . . . . . 169 Najeeb Abdulmutalib and Norbert Fuhr An Attribute-based Model for Semantic Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 175 Hany Azzam and Thomas Roelleke. Binary Histograms for Resource Selection in Peer-to-Peer Media Retrieval . . . . . 183 Daniel Blank and Andreas Henrich. Benutzerorientiertes Dokumenten-Clustering durch die Verwendung einer Anfragemenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Marc Lechtenfeld Semiautomatische Konstruktion von Trainingsdaten fu¨r die Suche in historischen Dokumenten . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Andrea Ernst-Gerlach and Norbert Fuhr An Evaluation of Geographic and Temporal Search . . . . . . . . . . . . . . . . . . . . . . . 199 Fredric Gey, Noriko Kando, and Ray Larson Image Retrieval on Mobile Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Adrian Hub and Andreas Henrich Named Entity Disambiguation for German News Articles . . . . . . . . . . . . . . . . . . 209 Andreas Lommatzsch, Danuta Ploch, Ernesto William De Luca, and Sahin Albayrak. Implications of Inter-Rater Agreement on a Student Information Retrieval Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Philipp Schaer, Philipp Mayr and Peter Mutschke Evaluation of five web search engines in Arabic language . . . . . . . . . . . . . . . . . . . 221 Wissam Tawileh, Thomas Mandl, and Joachim Griesbaum Ein Polyrepra¨sentatives Anfrageverfahren fu¨r das Multimedia Retrieval . . . . . . . 229 David Zellhoefer and Ingo Schmitt. Workshop WM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Knowledge System Prototyping for Usability Engineering . . . . . . . . . . . . . . . . . . 239 Martina Freiberg, Johannes Mitlmeier, Joachim Baumeister, and Frank Puppe Modeling of Diagnostic Guideline Knowledge in Semantic Wikis . . . . . . . . . . . . 247 Reinhard Hatko, Jochen Reutelshoefer, Joachim Baumeister, and Frank Puppe Linked Data Games: Simulating Human Association with Linked Data . . . . . . . . 255 Jo¨rn Hees, Thomas Roth-Berghofer, and Andreas Dengel Reuse of Pharmaceutical Experience on Patient-individual Formulations . . . . . . . 261 Mirjam Minor and Michael Raber Taking OWL to Athens: Semantic Web Technology takes Ancient Greek History to Students (Resubmission) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Jochen Reutelshoefer, Florian Lemmerich, Joachim Baumeister, Jorit Wintjes, and Lorenz Haas Integration of Linked Open Data in Case-Based Reasoning Systems . . . . . . . . . . 269 Christian Severin Sauer, Kerstin Bach, and Klaus-Dieter Althoff Memex360 - Perso¨nliches Wissensmanagement mit Theseus ORDO . . . . . . . . . . 275 Ralph Traphoener and Bjo¨rn Decker Workshop ABIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 What is wrong with the IMS Learning Design specification? Constraints And Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Daniel Burgos. Student Model Adjustment Through Random-Restart Hill Climbing . . . . . . . . . . 289 Ahmad Salim Doost and Erica Melis. On the Role of Social Tags in Filtering Interesting Resources from Folksonomies 295 Daniela Godoy. User Models meet Digital Object Memories in the Internet of Things . . . . . . . . . 303 Dominikus Heckmann. How Predictable Are You? A Comparison of Prediction Algorithms for Web Page Revisitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 Ricardo Kawase, George Papadakis, and Eelco Herder. User and Document Group Approach of Clustering in Tagging Systems . . . . . . . 315 Rong Pan, Guandong Xu, and Peter Dolog. Modeling, obtaining and storing data from social media tools with Artefact-Actor-Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 Wolfgang Reinhardt, Tobias Varlemann, Matthias Moi, and Adrian Wilke. Meta-rules: Improving Adaptation in Recommendation Systems . . . . . . . . . . . . . 331 Vicente Romero and Daniel Burgos. Social IPTV: a Survey on Chances and User-Acceptance . . . . . . . . . . . . . . . . . . . 337 Daniel Schreiber. Using a Semantic Multidimensional Approach to create a Contextual Recommender System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 Abdulbaki Uzun and Christian Ra¨ck. Preface The joint workshop event LWA 2010 (Lernen, Wissen, Adaptivita¨t) takes place from October 4th to October 6th 2010 in Kassel, Germany. Like in the years before the LWA hosts a broad scope of workshops of the special interest groups for – Adaptivity and User Modeling (FG-ABIS) – Information Retrieval (FG-IR) – Knowledge Discovery and Machine Learning (FG-KDML) – Knowledge and Experience Management (FG-WM) In addition to the separate workshops of each special interest group we invited two talks covering current research questions in computer science: – Ju¨rgen Geck: Turning the Web Inside Out. – Wolfgang Nejdl: Web of People – Improving Search on the Web. We are grateful for the support of all members of the Knowledge and Data En- gineering group at the University of Kassel, especially Monika Vopicka for handling many organizational and almost all issues with bureaucracy. Furthermore, we thank all participants of the workshops for their contributions. Additionally, we want to thank all reviewers for their careful help in selecting and improving the provided submissions. Kassel, September 2010 Martin Atzmueller, Dominik Benz, Andreas Hotho, Gerd Stumme 9 Workshop on Knowledge Discovery and Machine Learning KDML 2010 Martin Atzmueller Knowledge and Data Engineering Group University of Kassel, Germany atzmueller@cs.uni-kassel.de Dominik Benz Knowledge and Data Engineering Group University of Kassel, Germany benz@cs.uni-kassel.de The KDMLWorkshop Series The workshop Knowledge Discovery, Data Mining, and Machine Learning (KDML) 2010 is organized annually by the Special Interest Group on Knowledge Discovery, Data Mining, and Machine Learning (FG-KDML, former FGML) of the German Society on Computer Science (GI). The main goal of the workshop is to enable and stimulate the exchange of innovative ideas and to provide a forum for data mining and machine learning oriented researchers in order to discuss recent topics in these areas. Submissions from current research out of these and ad- jacent areas were welcome. Moreover, contributions that describe work in progress or approaches that have not yet been investigated in depth yet were of special inter- est. As every year, the workshop was a forum for test- ing the viability of new ideas, by junior as well as senior researchers. Additionally, several resubmissions of previ- ously published work proved to be interesting complements for initiating productive discussions. The workshop is part of the workshop week Learning - Knowledge Discovery - Adaptivity (LWA) 2010 that also features several other workshops. This provides the op- portunity to meet researchers from the related special in- terest groups on Adaptivity and Interaction, on Informa- tion Retrieval, and on Knowledge Management, and fos- ters (inter-workshop) scientific discussions and the impor- tant exchange of ideas. KDML 2010 The proceedings contain the papers presented at KDML 2010 held on October 4th–6th, 2010 in Kassel, Germany. 12 regular research papers, six short papers, and 12 resub- mussions of previously published work, in addition to three posters, were accepted. The topics of interest of the KDML workshop series are: • Mining and Analysis of Networks and Graphs • Rule Learning • Text Mining, Web Mining • Distributed Data Mining • Ubiquitous Knowledge Discovery • Unsupervised and Semi-Supervised Learning • Visual Analytics • Knowledge Discovery in inductive databases • Bioinformatics applications • Data Stream Mining • Temporal Knowledge Discovery Workshop Chairs • Martin Atzmueller, University of Kassel • Dominik Benz, University of Kassel Program Committee • Martin Atzmueller, University of Kassel • Dominik Benz, University of Kassel • Stephan Doerfel, University of Kassel • Folke Eisterlehner, University of Kassel • Zeno Gantner, University of Hildesheim • Robert Ja¨schke, University of Kassel • Peter Kluegl, University of Wu¨rzburg • Beate Krause, University of Wu¨rzburg • Florian Lemmerich, University of Wu¨rzburg • Bjo¨rn-Elmar Macek, University of Kassel • Leandro Marinho, Federal University of Maranhao • Christoph Scholz, University of Kassel We thank the authors for their submissions, and especially thank the Program Committee for their good work. August 2010 Martin Atzmueller Dominik Benz Acknowledgements This volume has been produced in part using the EasyChair system1. We would like to express our gratitude to its author Andrei Voronkov. 1http://www.easychair.org 11 Lifted Conditioning for Pairwise Marginals and Beyond ∗ Babak Ahmadi and Kristian Kersting and Fabian Hadiji Knowledge Dicovery Department, Fraunhofer IAIS 53754 Sankt Augustin, Germany firstname.lastname@iais.fraunhofer.de Abstract Lifted belief propagation (LBP) can be extremely fast at computing approximate marginal prob- ability distributions over single variables and neighboring ones in the underlying graphical model. It does, however, not prescribe a way to compute joint distributions over pairs, triples or k-tuples of distant random variables. In this pa- per, we present an algorithm, called conditioned LBP, for approximating these distributions. Es- sentially, we select variables one at a time for conditioning, running lifted belief propagation after each selection. This naive solution, how- ever, recomputes the lifted network in each step from scratch, therefore often canceling the ben- efits of lifted inference. We show how to avoid this by efficiently computing the lifted network for each conditioning directly from the one al- ready known for the single node marginals. This contribution advances the theoretical understand- ing of lifted inference but also allows one to ef- ficiently solve many important AI tasks such as finding the MAP assignment, sequential forward sampling, parameter estimation, active learning, sensitivity analysis, to name only few. Our ex- perimental results validate that significant effi- ciency gains are possible and illustrate the po- tential for second-order parameter estimation of Markov logic networks. 1 Introduction There has been much recent interest in methods for per- forming lifted probabilistic inference, handling whole sets of indistinguishable objects together. Poole [2003], Braz et al. [2005], and Milch et al. [2008] have developed lifted versions of the variable elimination algorithm. Sen et al. [2009] presented a lifted variable elimination approach based on bisimulation. Most of these lifted inference ap- proaches are extremely complex, so far do not easily scale to realistic domains and hence have only been applied to rather small artificial problems. A remarkable excep- tion are lifted versions of belief propagation [Singla and Domingos, 2008; Kersting et al., 2009]. Similar to Sen et al.’s approach, They grouped together random variables that have identical computation trees but now run a mod- ified belief propagation (BP) on the resulting lifted, i.e., clustered network. Being instances of BP, they can be ex- tremely fast at computing approximate marginal probabil- ∗This paper also appeared in PGM 2010 ity distributions over single variable nodes and neighboring ones in the underlying graphical model. Above all, they naturally scale to realistic domain sizes. Despite their suc- cess, however, lifted BP approaches do not provide a pre- scription to compute joint probabilities over pairs of non- neighboring variables in the graph. When the underlying graphical model is a tree, there is a single chain connect- ing any two nodes, and dynamic programming techniques might be developed for efficiently integrating out the inter- nal variables. When cycles exist, however, it is not clear what approximate procedure is appropriate. The situation is even more frustrating when computing marginals over triples or k-tuples of distant nodes. As for the non-lifted case, sophisticated exact lifted inference algorithms are only tractable on rather small models and do not scale to re- alistic domain sizes. It is precisely this problem that we are addressing in this paper, that is we are interested in approx- imate lifted inference algorithms based on the conditioning idea that scale to realistic domain sizes. Specifically, we present conditioned LBP (CLBP), a scalable lifted infer- ence algorithm for approximate inference based on condi- tioning. Essentially, we select variables one at a time for conditioning and run lifted belief propagation after each selection. This naive solution, however, recomputes the lifted network in each step from scratch, therefore often canceling the benefits of lifted inference. We show how to avoid this by efficiently computing the lifted network for each conditioning directly from the one already known for the single node marginals. There has been some prior work for related problems. Delcher et al. [1996] propose a data structure that allows efficient queries when new ev- idence is incorporated in singly connected Bayesian net- works and Acar et al. [2008] present an algorithm to adapt the model to structural changes using an extension of Rake- and-Compress Trees. The only lifted inference approach we are aware of is the work by Nath and Domingos [2010] that was independently developed in parallel. The au- thors essentially simulate their lifting procedure for a set of changed variables, obtaining the adapted lifted network. We also consider the problem of determining the best variable to condition on in each iteration to stay maxi- mally lifted over all iterations and propose a simple heuris- tic. Our experimental evaluation including experiments on second-order parameter estimation for Markov logic net- works [Richardson and Domingos, 2006] shows that sig- nificant efficiency gains are obtainable compared to naively running (lifted) BP in each iteration. CLBP may also have future applications in more advanced relational learning tasks such as active learning. We proceed as follows. We start off by briefly reviewing LBP. Then, we introduce CLBP, prove its soundness, and 13 X 1 X2 X3 f 1 f 2 X 1 X 2 f 1 X 2 X 3 f 2 … X n+1 f n X n X n+1 f n True True 1.2 True False 1.4 False True 2.0 False False 0.4 True True 1.2 True False 1.4 False True 2.0 False False 0.4 … True True 1.2 True False 1.4 False True 2.0 False False 0.4 Figure 1: An example for a factor graph — a chain graph model with n+ 1 nodes — with associated potentials. Cir- cles denote variables, squares denote factors. touch upon the problem of determining the best variable to condition on at each level of recursion. Before concluding, we present the results of our experimental evaluation. 2 Lifted Belief Propagation Let X = (X1, X2, . . . , Xn) be a set of n discrete-valued random variables each having d states, and let xi represent the possible realizations of random variable Xi. Graphical models compactly represent a joint distribution over X as a product of factors [Pearl, 1991], i.e., P (X = x) = Z−1 ∏ k fk(xk) . Each factor fk is a non-negative function of a subset of the variables xk, and Z is a normalization constant. If P (X = x) > 0 for all joint configurations x, the distribu- tion can be equivalently represented as a log-linear model: P (X = x) = Z−1 exp [∑ i wi · gi(x) ] , where the features gi(x) are arbitrary functions of (a subset of) the configuration x. Each graphical model can be represented as a factor graph. A factor graph, cf. Fig 1, is a bipartite graph that expresses the factorization structure of the joint distribu- tion. It has a variable node (denoted as a circle) for each variable Xi, a factor node (denoted as a square) for each fk, with an edge connecting variable node i to factor node k if and only if Xi is an argument of fk. We assume one factor fi(x) = exp [wi · gi(x)] per feature gi(x). An important (#P-complete) inference task is to compute the conditional probability of variables given the values of some others, the evidence, by summing out the remaining variables. The belief propagation (BP) algorithm is an effi- cient way to solve this problem that is exact when the fac- tor graph is a tree, but only approximate when the factor graph has cycles. Although this loopy BP has no guaran- tees of convergence or of giving the correct result, in prac- tice it often does, and can be much more efficient than other methods. BP can be elegantly described in terms of send- ing messages within a factor graph. The message from a variable X to a factor f is µX→f (x) = ∏ h∈nb(X)\{f} µh→X(x) where nb(X) is the set of factors X appears in. The mes- sage from a factor to a variable is µf→X(x) = ∑ ¬{X} f(x) ∏ Y ∈nb(f)\{X} µY→f (y)  Xn Xn+1X1 Xn/2-1 Xn Xn+1X1 X2 Xn/2-1 X2 … c cc Xn Xn+1X1 Xn/2-1X2 Figure 2: Supernodes, indicated by the shades of the orig- inal nodes, produced by repeatedly clamping nodes, indi- cated by ”c”,on a chain graph model with n + 1 nodes. Factors have been omitted here. The conditioning order is pi = {2, 1, 3, 4, . . . , n−2, n−1, n+1, n}. After clamping X2 all subsequent LBP runs work on the fully grounded network. where nb(f) are the arguments of f , and the sum is over all of these except X , denoted as ¬{X}. The messages are usually initialized to 1, and the unnormalized belief of each variable Xi can be computed from the equation bi(xi) = ∏ f∈nb(Xi) µf→Xi(xi) . Evidence is incorporated by setting f(x) = 0 for states x that are incompatible with it. Different schedules may be used for message-passing. Although already quite efficient, many graphical mod- els produce factor graphs with a lot of symmetries not re- flected in the graphical structure. Consider the factor graph in Fig. 1. Although the factors involved are different on the surface, they actually share quite a lot of information, since the associated potentials are identical. Lifted BP (LBP) can make use of this fact. It essentially performs two steps: Given a factor graph G, it first computes a compressed fac- tor graph G and then runs a modified BP on G. We use fraktur letters such as G, X, and f to denote the lifted, i.e., compressed graphs, nodes, and factors. Step 1 of LBP — Lifting by Color-Passing (CP): LetG be a given factor graph with variable and factor nodes. Ini- tially, all variable nodes fall into d+ 1 groups (one or more of these may be empty) — known states s1, . . . , sd, and unknown — represented by colors. All factor nodes with the same associated potentials also fall into one group rep- resented by a color. Now, each variable node sends a mes- sage to its neighboring factor nodes saying “I am of color C”. A factor node sorts the incoming colors into a vector according to the order the variables appear in its arguments. The last entry of the vector is the factor node’s own color. This color signature is sent back to the neighboring vari- ables nodes, essentially saying “You have communicated with these kinds of nodes”. The variable nodes stack the incoming signatures together and, hence, form unique sig- natures of their one-step message history. Variable nodes with the same stacked signatures are grouped together, and a new color is assigned to each group. The factors are grouped in a similar fashion based on the incoming color signatures of neighboring nodes. This CP process is iter- ated until no new colors are created anymore. As the effect of the evidence propagates through the factor graph, more groups are created. The final lifted graph G is constructed by grouping nodes with the same color (signatures) into su- pernodes and all factors with the same color signatures into 14 superfactors. Supernodes (superfactors) are sets of nodes (factors) that send and receive the same messages at each step of carrying out BP and form a partition of the nodes in G. On this lifted network, LBP runs an efficient modi- fied BP (MBP). We refer to [Singla and Domingos, 2008; Kersting et al., 2009] for details. (Step 2) Modified BP (MBP) on the Lifted Graph: The basic idea is to simulate BP carried out on G on G. 1 An edge from a superfactor f to a supernode Xi in G es- sentially represents multiple edges in G. Let c(f,Xi) be the number of identical messages that would be sent from the factors in the superfactor f to each node in the supern- ode Xi if BP was carried out on G. The message from a supervariable X to a superfactor f is µX→f(x) = µf→X(x)c(f,X)−1 · ∏ h∈nb(X)\{f} µh→X(x) c(h,X) where nb(X) now denotes the neighbor relation in the lifted graph G. The c(f,X) − 1 exponent reflects the fact that a subervariable’s message to a superfactor excludes the cor- responding factor’s message to the variable if BP was car- ried out on G. Finally, the unnormalized belief of Xi, i.e., of any node X in Xi can be computed from the equation bi(xi) = ∏ f∈nb(Xi) µf→Xi(xi) c(f,X) . Evidence is incorporated as in standard BP. by setting f(x) = 0 for states x that are incompatible with it. 3 Lifted Conditioning We are often faced with the problem of repeatedly answering slightly modified queries on the same net- work. Consider e.g. computing a joint distribution P (X1, X2, . . . , Xk) using LBP. A simple method is the following conditioning procedure that we call conditioned LBP (CLBP). Let pi define a conditioning order on the nodes, i.e., a permutation on the set {1, 2, . . . , k} and its i-th element be denoted as pi(i). The simplest one is pi(i) = i. Now, we select variables one at a time for condi- tioning, running LBP after each selection, and combine the resulting marginals. More precisely, 1. Run LBP to compute the prior distribution P (Xpi(1)). 2. Clamp Xpi(1) to a specific state xpi(1). Run LBP to compute the conditional distribution P (Xpi(2)|xpi(1)). 3. Do this for all states of Xpi(1) to obtain all condi- tional distributions P (Xpi(2)|Xpi(1)). The joint distri- bution is now P (Xpi(2), Xpi(1)) = P (Xpi(2)|Xpi(1)) · P (Xpi(1)). By iterating steps 2) and 3) and employing the chain rule we have P (X1, . . . , Xk) = P (Xpi(1), . . . , Xpi(k)) =∏k i=1 P (Xpi(i)|Xpi(i−1), . . . , Xpi(1)) . CLBP is simple and even exact for tree-structured models. Indeed, it is common to apply (L)BP to graphs with cycles as well. In this case the beliefs will in general not equal the true marginals, but often provide good approximations in practice. Moreover, Welling and Teh [2003] report that conditioning performs surprisingly well in terms of accuracy for estimating the co- variance2. In the lifted case, however, the naive solution of 1For the sake of simplicity, we present simplified equations that neglect the positions a supernode may appear in a superfactor. 2The symmetrized estimate of the covariance matrix is typi- cally not positive semi-definite and marginals computed from the joint distributions are often inconsistent with each other. repeatedly calling LBP may perform poorly in terms of run- ning time. We are repeatedly answering slightly modified queries on the same graph. Because LBP generally lacks the opportunity of adaptively changing the lifted graph and using the updated lifted graph for efficient inference, it is doomed to lift the original model in each iteration again from scratch. Each CP run scales O(n ·m) where n is the number of nodes and m is the length of the longest path without loop. Hence, CLBP essentially spendsO(k ·n ·m) time just on lifting. Moreover, in contrast to the proposi- tional case, the conditioning order has an effect on the sizes of the lifted networks produced and, hence, the running time of MBP. It may even cancel out the benefit of lifted in- ference. Reconsider our chain example3. Figure 2 sketches the lifted networks produced over time when using the con- ditioning order pi = {2, 1, 3, 4, . . . , n− 2, n− 1, n+ 1, n}. That is, we clamp X2, then all other nodes but Xn in as- cending order. As one can see, clampingX2 dooms all sub- sequent iterations to run MBP on the fully grounded net- work, canceling the benefits of lifted inference. In contrast, the order pi = {1, n+ 1, 2, n, . . . , n/2− 1} produces lifted and fully grounded networks alternatingly, the best we can achieve for chain models. We now address both issues. Shortest-Paths Lifting: Consider the situation depicted in Fig. 3. Given the network in (A) and the prior lifted net- work, i.e., the lifted network when no evidence has been set (B), we want to compute P (X|x3) as shown in (C). To do so, it is useful to describe BP in terms of its com- putation tree (CT), see e.g. [Ihler et al., 2005]. The CT is the unrolling of the (loopy) graph structure where each level i corresponds to the i-th iteration of message pass- ing. Similarly we can view CP, i.e., the lifting procedure as a colored computation tree (CCT). More precisely, one considers for every node X the computation tree rooted in X but now each node in the tree is colored according to the nodes’ initial colors, cf. Fig. 3(bottom). Each CCT encodes the root nodes’ local communication patterns that show all the colored paths along which node X communi- cates in the network. Consequently, CP groups nodes with respect to their CCTs: nodes having the same set of rooted paths of colors (node and factor names neglected) are clus- tered together. For instance, Fig. 3(A) shows the CCTs for X3 and X5. Because their set of paths are different, X3 and X5 are clustered into different supernodes as shown in Fig. 3(B). The prior lifted network can be encoded as the vector l = (0, 0, 1, 1, 0, 0) of node colors. Now, when we clamp a node, sayX3, to a value x3, we change the commu- nication pattern of every node having a path to X . Specif- ically, we change X3’s (and only X3’s) color in all CCTs X3 is involved. This is illustrated in Fig. 3(B). For the prior lifted network, the dark and light nodes in Fig. 3(B) exhibit the same communication pattern in the network. Conse- quently,X3 appears at the same positions in all correspond- ing CCTs. When we now incorporate evidence on nodeX3, we change its color in all CCTs as indicated by the ”c” in Figs. 3(B) and (C). This effects nodes X1 and X2 differ- ently than X4 respectively X5 and X6 for two reasons: (1) they have different communication patterns as they belong to different supernodes in the prior network; more impor- tantly, (2) they have different paths connecting them to X3 3When the graph is a chain or a tree there is a single chain con- necting any two nodes and LBP together with dynamic program- ming can be used to efficiently integrate out the internal variables. When cycles exist, however, it is unclear what approximate pro- cedure is appropriate. 15 X5 X4 X6 X2X1 X3 X1 X3 X2 X5 X6 X4 X3 X4 X1 X6 X5 X2 (B) Lifted model - no evidence Examples of colored computation trees (A) Originial factor graph X1 X3 X2 X5 X6 X4 (C) Lifted model - evidence X1 X3 X2 X5 X6 X4 X5 X4 X6 X2X1 X3 X5 X4 X6 X2X1 X3c X1 X3 X2 X6X5 X4 c X3 X4 c (D) Shortest Path Distances X1 X2 X3 X4 X6 X6 X1 0 2 1 2 3 3 X2 2 0 1 2 3 3 X3 1 1 0 1 2 2 X4 2 2 1 0 1 1 X5 3 3 2 1 0 1 X6 3 3 2 1 1 0 X5 X6 X2X1 Figure 3: (A): Original factor graph. (B): Prior lifted network, i.e., lifted factor graph with no evidence. (C): Lifted factor graph when X3 is set to some evidence. Factor graphs are shown (top) with corresponding colored computation trees (bottom). For the sake of simplicity, we assume identical factors (omitted here). Ovals denote variables/nodes. The shades in (B) and (C) encode the supernodes. (D): Shortest-path distances of the nodes. The i-th row will be denoted di. in their CCTs. The shortest path is the shortest sequence of factor colors connecting two nodes. Since we are not inter- ested in the paths but whether the paths are identical or not, these sets might as well be represented as colors. Note that in Fig. 3 we assume identical factors for simplicity. Thus in this case path colors reduce to distances. In the general case, however, we compare the paths, i.e. the sequence of factor colors. We only have to consider the vector d3 of shortest-paths distances to X3, cf. Fig. 3(D), and refine the initial supern- odes correspondingly. Recall that the prior lifted network can be encoded as the vector l = (0, 0, 1, 1, 0, 0) of node colors. This is equivalent to (1) l ⊕ d3, the element-wise concatenation of two vectors, and (2) viewing each result- ing number as a new color. (0, 0, 1, 1, 0, 0)⊕ (1, 1, 0, 1, 2, 2) =(1) (01, 01, 10, 11, 02, 02) =(2) (0, 0, 1, 2, 3, 3), the lifted network for P (X|x3) as shown in Fig. 3(C). Thus, we can directly update the prior lifted network in linear time without taking the detour through running CP on the ground network. Now, let us compute the lifted net- work for P (X|x4, x3). Essentially, we proceed as before: compute l ⊕ (d3 ⊕ d4). However, the resulting network might be suboptimal. It assumes x3 6= x4 and, hence, X3 and X4 cannot be in the same supernode. For x4 = x3, they could be placed in the same supernode, if they are in the same supernode in the prior network. This can be checked by d3 d4, the element-wise sort of two vectors. In our case, this yields l ⊕ (d3 d4) = l ⊕ l = l: the prior lifted network. In general, we compute l⊕ (⊕s(⊕v ds,v)) where ds,v = ⊙ i∈s:xi=v di , s and v are the supernodes and the truth value respectively. For an arbitrary network, however, the shortest paths might be identical although the nodes have to be split, i.e. they differ in a longer path, or in other words, the shortest paths of other nodes to the ev- idence node are different. Consequently we iteratively ap- ply the shortest paths lifting. Let SNS denote the supern- odes given the set S as evidence. By applying the short- est path procedure we compute SN{X1} from SN∅. This step might cause initial supernodes to be split into newly formed supernodes. To incorporate these changes in the network structure the shortest paths lifting procedure has to be iteratively applied. Thus in the next step we com- pute SN{X1}∪ΓX1 from SN{X1}, where ΓX1 denotes the changed supernodes of the previous step. This procedure is iteratively applied until no new supernodes are created. This essentially sketches the proof of the following theo- rem. Theorem 1. If the shortest-path colors among all nodes and the prior lifted network are given, computing the lifted network for P (X|Xi, . . . , X1), i > 0, takes O(i · n · s), where n is the number of nodes, s is the number of supern- odes. Running MBP produces the same results as running BP on the original model. Proof. For a GraphG = (V,E), when we set new evidence for a node X ∈ V then for all nodes within the network the color of node X in the CCTs is changed. If two nodes Y1, Y2 ∈ V were initially clustered together (denoted as sn0(Y1) = sn0(Y2)), i.e. they belong to the same supern- ode, they have to be split if the CCTs differ. Now we have to consider two cases: If the difference in the CCTs is in the shortest path connecting X with Y1 and Y2, re- spectively, then shortest-path lifting directly provides the new clustering. If the coloring along the shortest paths is identical the nodes’ CCTs might change in a longer path. Since sn0(Y1) = sn0(Y2) there exists a mapping between the paths of the respective CCTs. In particular ∃Z1, Z2, s.t. sn0(Z1) = sn0(Z2) from a different supern- ode, i.e. sn0(Zi) 6= sn0(Yi), and Y1, . . . , Z1, . . . , X︸ ︷︷ ︸ ∆1 ∈ CCT (Y1), Y1, . . . , Z2, . . . , X︸ ︷︷ ︸ ∆2 ∈ CCT (Y2) and ∆1 ∈ CCT (Z1) 6= ∆2 ∈ CCT (Z2) are the respective shortest paths for Z1 and Z2. Thus, by iteratively applying shortest- path lifting as explained above, the evidence propagates through and we obtain the new clustering. 16 0 5 10 15 20 25 0 100 200 300 400 500 600 To ta l M es sa ge s (m illio ns ) # Variables BP LBP CLBP 0 1e-05 2e-05 3e-05 4e-05 5e-05 6e-05 7e-05 8e-05 9e-05 0.0001 0 50 100 150 200 250 St an da rd D ev ia tio n # Variables BP LBP CLBP Figure 4: Pairwise Probability Estimates: (Left) Comparison of the total number of messages sent for BP, Lifted BP and ”min-split” order CLBP for ”Friends-and-Smokers” MLNs (including clustering messages for LBP and CLBP). (Right) The Standard Deviation of the error compared to the exact solution computed using the Junction Tree Algorithm. On Finding a Conditioning Order: Clearly, CLBP will be most efficient for estimating the probability of a joint state when it produces the smallest lifted networks. This calls for the task of finding the most efficient4 conditioning order. Here, we provide a generically applicable strategy based on the nodes’ shortest-path colors to all other nodes. That is, in each conditioning iteration, we add that node having the smallest number of unique paths to all other nodes and, if possible, is a member of a supernode of one of the already clamped nodes. Intuitively, we select nodes that are expected to create the smallest number of splits of existing supernodes in each iteration. Therefore, we call it min-split. Although this increases the running time — each conditioning iteration now has an additionalO(n2) step — our experiments show that there are important cases such as computing pairwise joint marginals where the efficiency gains achievable due to a better lifting can compensate this overhead. 4 Experimental Evaluation Our intention here is to illustrate the performance of CLBP compared to naively running LBP and BP. We implemented CLBP and its variants in Python and using LIBDAI li- brary [Mooij, 2009] and evaluated the algorithms on a num- ber of Markov logic networks. In our first experiment, we compared CLBP to naively running LBP, i.e. lifting the network each time from scratch, and BP for computing pairwise probabilities. We generated the ”Friends-and-Smokers” Markov logic net- work [Singla and Domingos, 2008] with 2, 5, 10, 15, 20, and 25 people, resulting in networks ranging from 8 to 675 nodes. The shortest-path lifting clearly pays out in terms of the total messages sent (including CP and shortest-path messages) as shown in Fig. 4 (left). Moreover, the accu- racy estimates are surprisingly good and confirm Welling and Teh [2003]; Fig. 4 (right) shows the Standard Devia- tion of the difference compared to the exact solution com- puted using the Junction Tree (JT). The maximal error we got was below 10−4. Note, however, that running JT with 4This question is different from the more common question of finding highly accurate orders. The latter question is an ac- tive research area already for the ground case see e.g. [Eaton and Ghahramani, 2009], and is also related to the difficult question of convergent BP variants, see e.g. [Mooij et al., 2007]. more than 20 persons was impossible due to memory and time restrictions. In our second experiment we investigated CLBP for computing joint marginals. For the ”Friends-and-Smokers” MLN with 20 people we randomly chose 1, 2, . . ., 10 ”can- cer” and ”friends” nodes as query nodes. The joint state was randomly chosen. The results are averaged over 10 runs. Fig. 5 shows the cumulative number of messages (in- cluding CP messages). ”Min-split” is indeed better. By chosing the order following our heuristic the cumulative number of supernodes and in turn messages is reduced compared to a random elimination order. Finally, we learnt paramters for the ”Friends-and- Smokers” MLN with 10 perople, maximizing the condi- tional marginal log-likelihood (CMLL). Therefore we sam- pled 5 data cases from the joint distribution. We com- pared conjugate gradient (CG) optimization using Polak- Ribiere with Newton conjugate gradient (NCG) optimiza- tion using the covariance matrix of MLN clauses com- puted using CLBP. The gradient was computed as de- scribed in [Richardson and Domingos, 2006] but normal- ized by the number of groundings of each clause. The re- sults summarized in Fig. 6 confirm that information about dependencies among clauses is indeed useful: the second order method exhibits faster convergence. 5 Conclusion We presented conditioned lifted BP, the first approach for computing arbitrary joint marginals using lifted BP. It relates conditioning to computing shortest-paths. Ex- ploiting this link in order to establish runtime bounds is an interesting avenue for future work. By combining lifted BP and variable conditioning, it can readily be applied to models of realistic domain size. As our results show significant efficiency gains are obtainable, sometimes order of magnitude, compared to naively running (lifted) BP in each iteration. An interesting avenue for future work is to apply CLBP within important AI tasks such as finding the MAP assignment, sequential forward sampling, and structure learning. Furthermore, our results suggest to develop lifted cutset conditioning algorithms, see e.g. [Bidyuk and Dechter, 2007], and to lift Eaton and Ghahrmani’s [2009] fast heuristic for selecting nodes to be clamped to improve CLBP’s accuracy. 17 0 50 100 150 200 250 300 1 2 3 4 5 6 7 8 9 10 # M es sa ge s (th ou sa nd s) # Query Variables BP random min-split Figure 5: Number messages sent for computing joint marginals of varying size for BP, ”random” and ”min- split” order CLBP. -320 -300 -280 -260 -240 -220 -200 -180 -160 0 5 10 15 20 25 30 35 40 CM LL Iteration CG NCG Figure 6: Learning curves for ”Friends-and-Smokers” MLN. Optimization using clause covariances shows faster convergence. Acknowledgements. This work was supported by the Fraunhofer ATTRACT fellowship STREAM and by the European Commission under contract number FP7- 248258-First-MM. References [Acar et al., 2008] U. A. Acar, A. T. Ihler, R. R. Mettu, and O¨. Su¨mer. Adaptive inference on general graphical models. In Proc. of the Conf. on Uncertainty in Artificial Intelligence (UAI-08), 2008. [Bidyuk and Dechter, 2007] B. Bidyuk and R. Dechter. Cutset sampling for bayesian networks. Journal of Arti- ficial Intelligence Research, 28, 2007. [de Salvo Braz et al., 2005] R. de Salvo Braz, E. Amir, and D. Roth. Lifted First Order Probabilistic Infer- ence. In Proc. of the 19th International Joint Conference on Artificial Intelligence (IJCAI-05), pages 1319–1325, 2005. [Delcher et al., 1996] A. L. Delcher, A. J. Grove, S. Kasif, and J. Pearl. Logarithmic-time updates and queries in probabilistic networks. JAIR, 4:37–59, 1996. [Eaton and Ghahramani, 2009] F. Eaton and Z. Ghahra- mani. Choosing a variable to clamp: Approximate in- ference using conditioned belief propagation. In Proc. of the 12th International Conference on Artificial Intel- ligence and Statistics (AIStats-09), 2009. [Ihler et al., 2005] A.T. Ihler, J.W. Fisher III, and A.S. Willsky. Loopy belief propagation: Convergence and effects of message errors. Journal of Machine Learning Research, 6:905–936, 2005. [Kersting et al., 2009] K. Kersting, B. Ahmadi, and S. Natarajan. Counting belief propagation. In J. Bilmes A. Ng, editor, Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI–09), Mon- treal, Canada, June 18–21 2009. [Milch et al., 2008] B. Milch, L. Zettlemoyer, K. Kersting, M. Haimes, and L. Pack Kaelbling. Lifted Probabilis- tic Inference with Counting Formulas. In Proc. of the 23rd AAAI Conf. on Artificial Intelligence (AAAI-08), July 13-17 2008. [Mooij et al., 2007] J. Mooij, B. Wemmenhove, H. Kap- pen, and T. Rizzo. Loop corrected belief propagation. In Proc. of the 11th International Conference on Artifi- cial Intelligence and Statistics (AIStats-09), 2007. [Mooij, 2009] Joris M. Mooij. libDAI 0.2.3: A free/open source C++ library for Discrete Approximate Inference. http://www.libdai.org/, 2009. [Murphy et al., 1999] K.P. Murphy, Y. Weiss, and M.I. Jordan. Loopy Belief Propagation for Approximate In- ference: An Empirical Study. In Proc. of the Conf. on Uncertainty in Artificial Intelligence (UAI-99), pages 467–475, 1999. [Nath and Domingos, 2010] A. Nath and P. Domingos. Ef- ficient lifting for online probabilistic inference. In Pro- ceedings of the Twenty-Fourth AAAI Conference on Ar- tificial Intelligence (AAAI-10), 2010. [Park, 2002] J.D. Park. MAP complexity results and ap- proximation methods. In Proc. of the Conf. on Un- certainty in Artificial Intelligence (UAI-02), pages 388– 396, 2002. [Pearl, 1991] J. Pearl. Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 2. edition, 1991. [Poole, 2003] D. Poole. First-Order Probabilistic Infer- ence. In Proc. of the 18th International Joint Confer- ence on Artificial Intelligence (IJCAI-05), pages 985– 991, 2003. [Richardson and Domingos, 2006] M. Richardson and P. Domingos. Markov Logic Networks. Machine Learning Journal, 62:107–136, 2006. [Sen et al., 2009] P. Sen, A. Deshpande, and L. Getoor. Bisimulation-based approximate lifted inference. In J. Bilmes A. Ng, editor, Proceedings of the 25th Confer- ence on Uncertainty in Artificial Intelligence (UAI–09), Montreal, Canada, June 18–21 2009. [Singla and Domingos, 2008] P. Singla and P. Domingos. Lifted First-Order Belief Propagation. In Proc. of the 23rd AAAI Conf. on Artificial Intelligence (AAAI-08), pages 1094–1099, July 13-17 2008. [Welling and Teh, 2003] M. Welling and Y.W. Teh. Linear response for approximate inference. In Proc. of NIPS- 03, pages 191–199, 2003. 18 The Smart-TSH-Finder: Crawling and Analyzing Tempo-Spatial Hotspots in Second Life Akram Al-Kouz , Ernesto William De Luca, Jan Clausen, Sahin Albayrak DAI-Labor, TU Berlin, 10587 Berlin, Ernst-Reuter-Platz {Akram, ernesto.deloca, Jan.clausen, sahin.albayrak}@dai-labor.de Abstract In this paper we introduce the Smart Tempo-Spatial Hot- spots Finder (Smart-TSH-Finder) for smart data crawling in the Second Life (SL) virtual world. Classical methods of crawling data from SL lead to irrelevant data content because of the dynamic nature of avatars and objects in SL. In order to build artificially intelligent expert avatar agents that are able to provide intelligent services for other typical avatars in virtual world we attempt to en- hance the quality of extracted data from SL. Based on experimental observation, avatars tend to gather in some places for different amounts of time, which forms tempo- ral and spatial hotspots. Utilizing the Tempo-Spatial char- acteristics of the avatars behavior in virtual worlds could improve the quality of the extracted data. Smart-TSH- Finder implements a Tempo-Spatial Hotspots finding mechanism to crawl dynamic contents such as chat con- versations from Second Life. The system introduces two mechanisms: the Tempo-Spatial Hotspots Detection, and the Tempo-Spatial Hotspots Prediction. Our smart chat conversations that have been crawled showed good en- hancement in content quality of the crawled chat conver- sation which will enrich the future textual analysis work. Additionally, we found that extracting avatars interactions and behavior in the Tempo-Spatial Hotspots in addition to chat conversations can help in generation a more coherent social network model for SL. 1 Introduction Second Life1 (SL) is a 3D virtual world developed by Linden Labs2, a free client program called the Viewer enables its users to interact with each other through avatars. Avatars can explore, meet, socialize, participate in individual and group activities. SL has monthly unique users with repeat logins pick of 826,214 in March 2010; this represents 7% growth over the high set in Q4 2009, and 13% growth over March 20093. The increasing popu- larity of SL as immersive virtual world made it an inter- esting target of research. In SL we have observed that avatars tend to gather in certain places (hotspots) in the land because of their sense to persistency. According to 1 www.secondlife.com 2 http://lindenlab.com 3 http://blogs.secondlife.com/community/blog/2010/04/28/econ omy-hits-new-all-time-high-in-q1-2010 the operative definition of presence of [Vives and Slater, 2005], in that we suppose avatars will behave in the same way in SL. In order to build smart expert avatar agents able to provide smart services for other typical avatars in virtual world we need to extract dynamic data contents from SL. A data crawler capable to extract the desired data is needed. But SL is a commercial product, we do not have access to statistical data, and it is unlikely that such data would be made public by Linden Labs. On the other hand SL puts restrictions on the amount and the nature of data that can be crawled. The land in SL is divided into Parcels which are subdivided in a 256x256 meter region, content is only presented to users as they move close to the content location, thus, user can extract the public chat log within the silence cone of 20 meters, added to his own behavior and interactions with objects in SL, so the crawler must dynamically model the environment based on the previously mentioned restrictions. We analyzed and utilized the Tempo-Spatial character- istics of the avatars behavior in SL to build Tempo-Spatial Hotspots finding system which contains smart agents (bots) able to extract dynamic contents such as chat con- versations from intelligently detected or predicted hot- spots in SL. Our goal was to determine trends in SL by analyzing the content of the extracted chat logs in future work to discover behavior patterns, conversation patterns, and conversation topics and to structure a social network in SL. Furthermore, evaluating how useful this informa- tion is in building user profile to be used in future work for recommendation or integration in our previous work done with the SpreeLand Expert Avatar [Al-Kouz et al., 2010]. We developed the proposed system based on the LibOpenMetaverse4 open-source software related to the metaverse and virtual worlds. While our system is com- patible with SL, it is also capable to integrate with other alternative virtual worlds in OpenSimulator5. In this work, we first present related work in the next section. Then, we introduce our system architecture and the related analysis mechanisms and algorithms used to build it, in the following section we discuss the empirical evaluation experiment. The paper concludes with a dis- cussion and future work. 4 LibOpenMetaverse http://www.libsecondlife.org 5 http://opensimulator.org/wiki/Main_Page 19 2 Related Works At the moment, there is no previous work to smart use of the Tempo-Spatial characteristics of the avatars behavior in SL and related virtual worlds for enhancing the quality of crawled data. However, there are some published stud- ies about utilizing both of the characteristics separately. Due to the lack of publicly chat data sets and the closed nature of SL as commercial product, the field has not yet gained great interest as opposed to the currently prevailing research on the typical Instant Messaging. Obviously, some research presents an intelligent agent crawler de- signed to collect user-generated content in SL. The agents navigate autonomously through the world to discover re- gions. It shows that virtual worlds can be effectively ex- tracted using autonomous agent crawlers that emulate normal user behavior [Eno et al., 2009]. A refined ap- proach presented analysis and visualization of the spatial and temporal distribution of user interaction data collected in 3D virtual worlds [Börner et al., 2001]. However it is not discussing its effect on the extracted data quality. A more sophisticated model developed a chat message analysis system called IMAnalysis using text mining techniques, which includes functions for chat message retrieval in general and not dedicated to 3D virtual worlds [Hui et al., 2008]. There has been some work on crawling and exploring virtual worlds for reasons other than con- tent collection [Varvello et al., 2008], [La and Pietro, 2008]. Even though, none of mentioned researches used Tempo-Spatial characteristics of SL to enhance the quality of crawled data. 3 Our Approach Based on our observation, SL avatars tend to gather in specific trendy spots for variant amount of time, taking this fact into consideration adding to the constraints on data crawling by Linden Labs, we find that the classical methods of crawling data from SL such as extracting data by a logged in automated bot into SL in specific location, or by making the bot to follow some avatars to imitate the natural behavior of the avatars, lead to irrelevant data con- tent, because of the dynamic nature of avatars and objects in SL. In our attempt to enhance the quality of extracted data from SL, a smart data crawler system based on the Tempo-Spatial analysis of the avatars behavior was de- veloped to crawl dynamic contents such as chat conversa- tions from SL. Our system is based on the client/server protocol of the open source LibOpenMetaverse library. 3.1 System Architecture As we have shown in Figure 1, the architecture of our proposed Smart-TSH-Finder system has two main mod- ules: the Virtual Environment Layer module and the Smart-TSH-Finder Crawler module. The first establishes the network communications with the virtual world. The last contains three sub modules: Login Manager, Heat Maps Manager, and Crawling Manager. Heat Maps Man- ager introduces two alternative smart mechanisms to find the trendiest hotspots in the virtual world suitable to crawl dynamic contents from: the Tempo-Spatial Hot Spots De- tection service, and the Tempo-Spatial Hot Spots Predic- tion service. The Virtual Environment Layer Module The Virtual Environment Layer Module is an intermediate network abstraction layer between Smart-TSH-Finder Crawler and the virtual world, it was built on top of Li- bOpenMetaverse library to customize some of the func- tionalities required by our Smart-TSH-Finder Crawler. LibOpenMetaverse runs on top of the .NET runtime and maintains compatibility with the Second Life protocol and can be used for creating clients and automatons in Second Life, OpenSimulator or other virtual worlds which use the Second Life Protocol. Figure1: Smart-TSH-Finder Architecture The Smart-TSH-Finder Module The Smart-TSH-Finder Module is a distributed artificial intelligence agent built on top the Virtual Environment. It is capable to connect multiple lands and extract dynamic contents such as chat conversations logs and some ava- tars’ behavior actions. Smart-TSH-Finder Crawler has three sub modules: the Login Manager sub module, the Heat Maps Manager sub module and the Crawling Man- ager sub module. The Login Manager Sub Module The Login Manager is capable to connect to multiple lands simultaneously, it handles the authentication proc- esses with the virtual environment, an active SL user cre- dentials are required. After login it stores all the standard retrieved data from the virtual world in the Login Reposi- tory as explained in Figure 1. The Heat Maps Manager Sub Module The Heat Maps Manager is the core of our system, it uses the data stored in the Login Repository to generate a spa- tial heat map for the currently logged in land as shown in Figure 2, taking into consideration the virtual world speci- fications, such as the size of the Parcel (256*256 meters) and silence cone (20 meters), it generates a 2D matrix corresponding to the spatially discovered hotspots. The generated 2D matrix is used as the base for the smart Tempo-Spatial analysis in the Detection Service and the Prediction Service to determine the trendiest spots that are suitable to extract informative data from. Detection Service The Detection Service implements the Detection Algo- rithm which will be discussed in the first part of the Smart-TSH-Finder Algorithms section to discover the trendy Tempo-Spatial spots. It will suggest the most fit- ting spots between temporal and spatial spots, added to that it has the capability to use a baized criteria specified by the user to retrieve hotspots temporally or spatially baized. The retrieved hotspots will be stored in the Hot- spots Repository. 20 Figure 2: Heat Maps Manager Screenshot. Prediction Service The Prediction Service implements the Prediction Algo- rithm which will be discussed in the second part of the Smart-TSH-Finder Algorithms section to predict the next trendy spatial spots based on time stamp. It will predict the trendiest spots based on a user selected time slice, and predict the next probable hotspot using linear regression. The predicted hotspots will be sorted in the Hotspots Re- pository. Crawling Manager Sub Module The Crawling Manager is responsible for management of the data crawling process. It uses the data stored in the Hotspots Repository to generate a smart agent for each hotspot, the generated smart agent is programmatically controlled bot, which has the capability to extract dynamic contents from the virtual world. Furthermore, it can han- dle some Smart Objects. Smart Object is a geometric ob- ject, such as cube or circle, manually prebuilt in a free land and attached to the user’s avatar, it has script able to response to some interactions and extracts some data. The extracted data will be stored in the Hotspots Repository for further analysis. 3.2 Smart-TSH-Finder Algorithms In this section we describe the algorithms used to analyze the Tempo-Spatial behavior of the avatars based on the spatial heat map for the currently logged in land generated by the Heat Maps Manager as shown in Figure 2. The corresponding 2D matrix will be used as the base for the smart Tempo-Spatial analysis mentioned in the Detection mechanism and the prediction mechanism to determine the hotspots that are suitable to extract data from. Hotspots Detection Algorithm In the following we present an algorithm by which we detect hotspots of the land. The detection algorithm is an unsupervised learning method to assign set of Tempo- Spatial observations which are the visits frequency of the location as spatial factor and the average visit time of the location as temporal factor. In our experiment of the dis- covered temporal and spatial hotspots exhibit a Pearson correlation coefficient of approximately 0.74, details are provided by Figure 3, where the x axis is a numeric repre- sentation of the hotspot, the left y axis represents the visits frequency and the right y axis represents the average visit time in second, the plotted curve with diamonds repre- sents the spatial values and the curve with circles repre- sent the temporal values. Figure 3: Tempo-Spatial Relation of Hotspots. Following we describe how Tempo-Spatial Hotspots are discovered. First the mean or median respectively for each coordinate is computed. If a spot has a value greater than the mean in each component, it is considered as a Tempo- Spatial Hotspot. By this we find the most suitable fitting between the temporal and spatial hotspots. Furthermore, the Detection Service has alternative statistical techniques to find fitting between the temporal and spatial hotspots leading to discovering the most trendy hotspots both tem- porally and spatially as shown in Figure 4, where the x axis is the average visit time per hotspot and the y axis is the visits frequency of that hotspot, the fitted hotspots are plotted as big circles, the others as small diamonds. Figure 4: Tempo-Spatial Fitted Hotspots. The default fitting criteria is the logical AND between the mean of the temporal hotspots and the mean of spatial hotspots, further the user can alternate between median and standard deviation as fitting criteria, fitted hotspots should be greater than the components means, median, or standard deviation respectively. On the other hand, the Detection Service will enable the user of the system to specify the biases into the temporal direction to crawl long chat conversations with low number of participants, or into the spatial direction to crawl long chat conversations with high number of participants. Hotspots Prediction Algorithm The Hotspots Prediction Algorithm uses the previously generated Heat Map to predict the coming hotspots in terms of time. It uses the linear regression analysis to ex- 21 pect the most top hotspots and its sequence in the next time slice. The spatial hotspots will be classified based on the user selected time slice generating a 2D matrix HS [i, j]. The first index i is a representation of the hotspot num- ber, the second index j is a representation of the accumu- lated time slices and the value at HS [i, j] is the visit fre- quency of the ith hotspot at the jth time slice as shown in the Table 5. HS/Time i1 i2 i3 i4 i5 i6 i7 j1 124 24 75 2 4 133 41 j2 93 53 55 4 3 30 8 j3 24 45 2 3 12 33 17 j4 53 13 12 0 0 18 0 j5 0 10 22 0 5 10 20 j6 0 0 0 0 12 12 0 j7 0 0 5 0 0 0 12 Table 5: Sample Hotspots Matrix The Hotspots Prediction Service generates a 3D surface representation of HS matrix, as shown in Figure 6 hot- spots reach their peak at different times. The system use the current time stamp to predict the next trendy spot by retrieving the correspondence visit frequency of all the hotspots in that time interval into two vectors, one to store the hotspot number and the other to store the correspond- ing visit frequency. The Linear Regression algorithm will be applied on the two vectors, hotspot with the least abso- lute value of residuals will be suggested as the next hot- spot. Figure 6: 3D Representation of the Sample Hotspots Matrix in Table 4. 4 Evaluation In order to evaluate our system, we compare the quality of the extracted data in both the classical methods and in our proposed method based on the mentioned Tempo-Spatial Hotspots Finding Algorithms. 4.1 Data Crawling We used Solace Cove Land6 to do the experiment, Solace Cove is a popular Second Life virtual space for meeting friends, doing business, and sharing knowledge. We ran 6 http://maps.secondlife.com/secondlife/Solace20Cove/121/81/24 our system in the mentioned land five times for thirty minutes every time. Our system found the Tempo-Spatial Hotspots and sent four smart agents to that spots to start data extraction. At the same time we logged a bot in Sol- ace Cove Land using classical LibObenMetaverse dashboard client and start extracting data by enforcing the bot to follow a specific avatar and record the chat log. In the previously mentioned two cases we fired up a specific discussion question “How was the FIFA world cup 2010 in South Africa” several times, and then we extracted ava- tars responses for thirty minutes. We got four chat log files from our system prefixes with hotspot number each one contains a different number of utterances and one chat log file from the classical dashboard prefixes with “DB” as shown in Table 7. 4.2 Data Preparation In order to compare the quality of the crawled data from our system with the quality of the crawled data from the classical dashboard, we applied a windowing mechanism to split chat log files into smaller files. We considered the window size of eight utterances, six utterances as the av- erage size of chat conversation based on [Ogura et al., 2004], one for pre overlapping and one for post overlap- ping. Utterances fit in each window are stored as a sepa- rate file prefixes with chat log file name and window number as in Table 7. Chat Log Files No. of Utterances No. Of Windows HS1ChatLog 213 35 HS2ChatLog 322 53 HS3ChatLog 105 17 HS4ChatLog 96 16 DBChatLog 386 64 Table 7: Sample Extracted Chat Log files and its Windows 4.3 Evaluation Process In our evaluation process we considered the previously fired question “How was the FIFA world cup 2010 in South Africa” as query, and we used a ranking function based on summing the TF–IDF7 weight (term frequency– inverse document frequency) for each query term to evaluate how important that term is to a document in a collection or corpus. In this case each window file consid- ered as a document and all the windows corresponding to one chat log file considered as corpus. To normalize the results we evaluated the first windows as the number of windows in chat log file of the lowest number of win- dows. We used the algorithm in Figure 8 to rank each chat log file. Chat log files with higher rank are the most rele- vant to our query. 4.4 Evaluation Results We conducted five runs of our experiment and took the average results to generalize observations. Average results show that three out of the four hotspots chat log files (HS1ChatLog, HS3ChatLog and HS4ChatLog) which were extracted by our system have higher rank than the dashboard chat log file “DBChatLog” which extracted by the classical method. Thus, chat log file “HS2ChatLog” 7 http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf- weighting-1.html 22 has rank lass than but close to the rank of dashboard chat log file “DBChatLog”. Further analysis of the results and extracted data has been done to discover that dashboard chat log file “DBChatLog” beyond chat log file “HS2ChatLog” in rank because the followed avatar used in the classical way hanged in one place for long time and that place was selected by our system as a hotspot, so our extracted data at this hotspot is a subset of the “DBChat- Log” file which explains why we got low rank in this hot- spot. Beyond that our system showed high ranking for the extracted data from the discovered hotspots which means our extracted data is more relevant and coherent. Figure 8: Chat Log File Ranking Algorithm 5 Conclusion and Future Work A Smart-TSH-Finder for Second Life has been developed as a substitute to the classical methods of crawling data from SL. The proposed system enhance the quality of the extracted data, which considered a major step in building artificially intelligent expert avatar agents able to provide smart services for other typical avatars in virtual world. Smart-TSH-Finder Crawler utilized the Tempo-Spatial characteristics of the avatars behavior in virtual worlds to improve the quality of the extracted data to discover the temporal and spatial hotspots that avatars tend to gather in. The system introduces two mechanisms: the Tempo- Spatial Hotspots Detection, and the Tempo-Spatial Hot- spots Prediction. System user have the choice to use the former or the last mechanism to determine the most trendy hotspots, after that the system creates an agent for each trendy hotspot and send it to that hotspot, the generated agent starts crawling data from its desired hotspot. The empirical evaluation experiment of the crawled chat conversations showed good enhancement in content quality which enriches the future textual analysis work to discover behavioral patterns, conversation patterns, and conversation topics. Additionally, we found that extract- ing avatars interactions and behavior in the Tempo-Spatial Hotspots in addition to chat conversations can help in modeling a social network for Second Life. Furthermore, evaluating how useful this information is in building user profile to be used in future work for recommendation or integration with our previous work done with our previous work. References [Vives and Slater, 2005] Sanchez Vives, M.V., Slater, M, From presence to consciousness through virtual reality. Nature Reviews Neuroscience 6(4), 332–339, 2005. [Al-Kouz et al., 2010] Akram Al-Kouz, Winfried Um- brath, Ernesto William De Luca, Sahin Albayrak, SpreeLand: An Expert Avatar Solution for Virtual World E-Learning Environments, the International Conference on E-Learning in the Workplace, 2010. For each chat log file For each window ChatLogFileRank += WindowRank ∑ ×= nTermiTerm IDFTFWindowRank Terms= “word”, ”cup”, “2010”,”south”,”africa” iTerm: First entry in Terms nTerm: Last entry in Terms TW TTTF = TT: No. of times term appears in window TW: No. of words in window ⎟⎠ ⎞⎜⎝ ⎛= WT WLogIDF W: No. of windows in chat log file WT: No. of the windows the term appeared in [Eno et al., 2009] Joshua Eno, Susan Gauch, Craig Thompson, Intelligent Crawling in Virtual Worlds IEEE/WIC/ACM International Conference on Web In- telligence and Intelligent Agent Technology – Work- shops, 2009. [Börner et al., 2001] Katy Börner, William R. Hazlewood & Sy-Miaw Lin, Visualizing the Spatial and Temporal Distribution of User Interaction Data Collected in Three-Dimensional Virtual Worlds, 2001. [Hui et al., 2008] Hui, S.C. , Yulan He , Haichao Dong , Text Mining for Chat Message Analysis, IEEE Confer- ence on Cybernetics and Intelligent Systems, 2008. [Varvello et al., 2008] M. Varvello, F. Picconi, C. Diot, and E. Bier sack, Is there life in Second Life?, Thom- son Technical Report CR-PRL-2008-07-0002, 2008. [La and Pietro, 2008] C-A. La and M. Pietro, Characteriz- ing user mobility in Second Life, Technical Report RR- 08-212, Institut Eurecom, 2008. [Ogura et al., 2004] Kanayo Ogura, Masato Ishizaki and Kazushi Nishimoto, A Method of Extracting Topic Threads Towards Facilitating Knowledge Creation in Chat Conversations, Knowledge-Based Intelligent In- formation and Engineering Systems, 2004. 23 Stream-based Community Discovery via Relational Hypergraph Factorization on Evolving Networks Christian Bockermann and Felix Jungermann Technical University of Dortmund, Artificial Intelligence Group Baroper Strasse 301, Dortmund, Germany {bockermann,jungermann}@ls8.cs.tu-dortmund.de Abstract The discovery of communities or interrelations in social networks has become an important area of research. The increasing amount of informa- tion available in these networks and its decreas- ing life-time poses tight constraints on the infor- mation processing – storage of the data is often prohibited due to its sheer volume. In this paper we adapt a flexible approach for community discovery offering the integration of new information into the model. The continu- ous integration is combined with a time-based weighting of the data allowing for disposing ob- solete information from the model building pro- cess. We demonstrate the usefulness of our approach by applying it on the popular Twitter network. The proposed solution can be directly fed with streaming data from Twitter, providing an up-to- date community model. 1 Introduction Social networks like Twitter or Facebook have recently gained a lot of interest in data analysis. A social net- work basically consists of various types of entities – such as users, keywords or resources – which are in some way related to one another. A central question is often the dis- covery of groups of individuals within such networks - the finding of communities. Thus, we are seeking for a clus- tering of the set of entities into subsets where the individ- uals within each subset are most similar to each other and are most dissimilar to the entities of all other subsets. The similarity of entities is provided by their relations to one another. The relations between different entities are implied by the communication taking place within the network. Users exchange messages, which contain references to other users, are tagged with keywords or link to external re- sources by means of URLs. Figure 1 shows a message from the Twitter platform, which implies relations between the user yarapavan, the URL http://j.mp/fpga-mr and the tag #ML. A natural perception of a social network is that of a connected graph, which models each entity as a node and contains (weighted) edges between related entities. Such a graph can be easily described by its adjacency matrix: with d being the number of entities in our social network, we will end up with a (sparse) matrix A of size d2, where Ai,j = w if entity i is related to j with weight w and 0 Figure 1: Example tweets of the Twitter platform otherwise. However this representation is not well-suited for n-ary relations. A well-established representation of multi-dimensional relations is given by tensors [1; 2; 5; 17; 12; 6; 19]. A tensor is a multi-way array and can be seen as a generalization of a matrix. Tensors have been successfully used in multi- dimensional analysis and recently gained attention in social network mining [1; 2; 5]. In the case of social networks, tensors can be used to describe n-ary relations by using one tensor for each type of relations. Ternary relations of type (user,tag,url) can then be described by a mode-3 tensor X with X i,j,k = { w if user i, tag j and url k are related 0 otherwise. More complex n-ary relations will be reflected in tensors of mode-n. Tensor based Community Discovery Community discovery in such tensor representations is mapped to a decomposition of the tensors into a product of matrices U (i) ∈ Rmi×k which approximates the tensor X ≈ [z] ∏ i ×diU (i). Each of the matrices U (i) in turn reflects a mapping of entities to clusters {1, . . . , k}. The [z] factor is a super- diagonal tensor which serves as a “glue element” – see Section 3 for details. A variety of different decomposi- tion techniques such as Tucker3 or PARAFAC (CP) has been previously proposed [3; 14; 7]. Approximation is com- monly measured by some divergence function. In [5] the authors proposed a clustering framework based on tensor decompositions which has been generalized for Bregman divergences. In [4] Bader et al. used CP tensor decom- position to detect and track informative discussions from the Enron email dataset by working on the ternary relation (term,author,time). These approaches have been applied to decompose single tensors. In [16] the authors introduced METAFAC, which is a factorization of a set of tensors with shared factors (U (q) matrices). This allows for the discov- ery of one global clustering based on multiple tensor de- scriptions of the data. The time complexity for these tensor 25 decompositions is generally given by the number of non- zero elements of the tensors (provided that a sparse repre- sentation is used). Stream-based Community Discovery The majority of the tensor decomposition methods so far is based on a static data set. To incorporate streaming data, the stream is broken down into blocks and the decomposi- tions are re-computed for each of the new blocks [16]. A common way to handle time is to introduce a trade-off fac- tor of the old data and the data contained in the new blocks. In [18] Sun et al. presented dynamic tensor analysis. They handle n-ary relations by tensor decomposition using stream-based approximations of correlation matrices. They also presented a stream-based approach which is not really comparable to ours. They are processing a tensor contain- ing data by unfolding the tensors to every single mode and after that they are handling every column of the resulting matrices in a stream to update their model. In reality, we cannot assume such an original tensor to be given. In con- trast to [18], we consider multiple relations which have to be updated at each iteration instead of just one. Contributions The critical bottle-neck within the tensor decomposition methods often is their runtime. As of [16], the runtime for a decomposition of a set of tensors can be bound by O(N), where N is the number of entries in all tensors. However, this number can be rather large – we extracted about 590k entries (relations) from 200k messages of the Twitter plat- form. In this work, we present an adaption of the METAFAC framework proposed in [16]. Our contributions are as fol- lows: 1. We integrate a sampling strategy into the METAFAC framework. Effectively we limit the maximum size of the tensors – and therefore N – and use a least- recently-used approach to replace old entities if the limit of an entity type exceeds. 2. We introduce a time-based weighting for relations contained within the tensors. These weights will de- crease over time, reflecting the decreasing importance of links within the social networks. 3. We present an adaption of the METAFAC factorization which allows for a continuous integration of new rela- tions into the factorization model. Instead of running the optimization in a per-block mode, we provide a way to simultaneously optimize the model while new data arrives. 4. Finally, we provide an evaluation of our proposed adaptions on real-world data. The rest of this paper is structured as follows: Section 2 formalizes the problem and presents the METAFAC ap- proach on which this work is based. Following that, we give an overview of tensor decomposition in Section 3 and provide the basics for the multilinear algebra terminology required. In Section 4 we introduce our stream-based adap- tion of the METAFAC algorithm. We evaluated our stream- ing approach on real world data (Section 5) and present our findings in Section 6. 2 Multi-Relational Graphs As denoted above, a social network generally consists of a set of related entities. In general, we are given sets V1, . . . , Vk of entities of different types, such as users, key- words or urls. Let Vi be the i-th type of entities, e.g. V1 corresponds to users, V2 refers to keywords and so on. A relation then is a tuple of entities, e.g. a user-keyword re- lation (u1, k1) is an element of V1 × V2. We also refer to R := V1×V2 as the relation typeR of the relation (u1, k1). The entities are given as strings, and we define a map- ping ϕi for each entity type Vi, which maps entities to in- tegers ϕi : Vi → {0, . . . , |Vi| − 1}. The mapping ϕi can be some arbitrary bijective function. For some w ∈ Vi we refer to ϕi(w) as the index of w. We denote the string of an entity given by its index j by ϕ−1(j). This allows us to identify each entity by its index and enables us to describe a set of relations between entities by a tensor. A tensor X is a generalization of a matrix and can be seen as a higher-order matrix. A mode-k tensor X ∈ RI1×...×Ik is a schema with k dimensions where X i1,...,ik ∈ R, ij ∈ Ij denotes the entry at position (i1, . . . , ik). For k = 2 this directly corresponds to a simple matrix whereas k = 3 is a cube. With the mappings ϕi of entities and the tensor schema, a set of relations X ⊆ Vi1 × . . . × Vil(i) can be defined as a mode-k tensor X ∈ R|Vi1 |···|Vil(i) | with X ν1,...,νl(i) = { 1 if (ϕ−11 (ν1), . . . , ϕ −1 l(i)(νl(i))) ∈ X 0 otherwise (1) where ϕ−1i (νi) denotes the mapping ϕi that corresponds to the i-th relation type, and Vi1 × . . .× Vil(i) are the indexes of the entity types used in the relation i. 2.1 MetaGraph Following the above approach for k = 2, we would be con- sidering only binary relations, which correspond to edges in the graph representation of the social network. Thus the adjacency matrix for such a graph would be resembled within a collection of mode-2 tensors. MetaGraph introduced by [16] is a relational hypergraph representing multi-dimensional data in a network of enti- ties. A MetaGraph is defined as a graphG = (V,E), where each vertex corresponds to a set of entities of the same type and each edge is defined as a hyper-edge connecting two or more vertices. By the use of hyper-edges, the MetaGraph captures multi-dimensional relations of the social network and therefore provides a framework to model n-ary rela- tions. Given the notion of relation types defined above, each relation type Ri = Vi1 × . . . × Vil(i) corresponds to a hyper-edge in the MetaGraph G. Each relation type Ri = (vi1 , . . . , vil) observed within the social network is reflected in a hyper-edge of the MetaGraph. Given a fixed set of relation types R1, . . . , Rn, we can model the occur- rence of relations of type Ri by defining a Tensor X (i) for each Ri as described in (1). This approach results in a description of the social net- work by means of different relational aspects R1, . . . , Rn. Each type Ri of relations for which a tensor is defined, re- flects a subset of all the relations of the network. Capturing the complete set of relations among all entities would ob- viously result in |P(V )| = 2|V | different tensors. 26 user tag tweet url R1 R2 R3 R4 R5 R6 R7 R8 Figure 2: MetaGraph for Twitter Figure 2 visualizes the metagraph we used for mod- eling possible relations in the microblogging framework Twitter. As an example, the relation R8 referring to (user,tweet,tag,url) is represented by a hyper-edge connect- ing four vertices. 2.2 Community Discovery Problem With the use of tensors we have an approximated descrip- tion of our social networks by means of a set of relation typesR1, . . . , Rn. Thus we can describe our network graph G by means of the data tensors which are defined according to the observed relations R1, . . . , Rn in G, i.e. G 7→ { X (1), . . . ,X (n) } . Based on this description we seek for a further partitioning of the tensor representation into clusters of entities. The solution proposed in [16] is a factorization of the tensors X (i) into products of matrices U (q) which share a global factor [z] and some of the U (q) matrices. Let X (i) be the tensor describing Vi1 × . . . × Vil(i) , then we can factorize this as X (i) ≈ [z] l(i)∏ j=1 ×jU (ij). (2) Within this factorization, the [z] factor is a super- diagonal tensor containing non-zero values only at posi- tions (i, i, . . . , i). The U (q) are R|Vq|×k matrices, where |Vq| is the number of entities of the q-th entity type and k is the number of communities we are looking for. For ten- sors which relate to relation types with overlapping entity types (e.g. (user,keyword,tag) and (user,keyword,url)) the corresponding factorizations share the related U (q) matri- ces (e.g. Uuser and Ukeyword). The ×j is the mode-j product of a tensor with a matrix. With an appropriate normalization as used in [16], the U (q) matrices only contain values of [0, 1] which can be interpreted as probability values. Based on this, the value of U (q)l,m can be seen as the probability of entity ϕ −1 q (l) belonging to cluster m ∈ {1, . . . , k} and we can simply map an entity to its cluster C(l) by C(l) = argmax m U (q) l,m. (3) Thus, the community discovery is mapped onto the si- multaneous factorization of a set of tensors. The objective is to find a factorized representation, which resembles the original data tensors {X (i)} as closely as possible. Given some distance measure D : RI1×...×Il × RI1×...×Il → R this leads to the following optimization problem: argmin [z],{U(q)} n∑ i=1 D(X (i), [z] l(i)∏ j=1 ×jU (ij)) (4) 2.3 Batch Processing of Evolving Tensors To capture the evolving behavior of the data tensors Lin et.al. proposed a batch processing approach. The stream is processed as disjoint sliding windows. Let t denote the current window and denote by zt−1, {U (q)t−1} the model ob- tained so far. Then the time-based optimization problem of [16] yielding the new model zt, {U (q)t } is given as argmin zt,{U(q)t } (1− α) n∑ i=1 D(X (i)[zt] l(i)∏ j=1 ×jU (i)t ) + αLprior (5) Lprior = D([zt−1][zt]) + n∑ i=1 D(U (i) t−1U (i) t ). (6) The Lprior reflects the divergence between the new and the old model, whereas the α specifies a trade-off between the models. 3 Tensors & Tensor Factorizations The previous sections presented tensors as a mathemati- cal structure to model multi-dimensional relations and mo- tivated their use to describe multi-relational data such as community networks. In this section we will introduce ten- sors and their factorizations in more detail. 3.1 Tensor Decomposition Tensors can be decomposed into a sum of component rank- one tensors [11]. A popular method for factorizing a tensor into a product of matrices is the PARAFAC decomposition (CP-decomposition) by [9]. Using CP, a third-order tensor X ∈ RI×J×K can be decomposed to X = [ R∑ r=1 ar ◦ br ◦ cr ] + E (7) where R is a positive integer and ar ∈ RI , br ∈ RJ , cr ∈ RK . See Figure 3 for a schema of this decomposition. This decomposition is an approximation of X by an error of E . Neglecting this error tensor E we are left with the ap- proximation of X by X ≈ R∑ r=1 ar ◦ br ◦ cr. (8) 27 ≈+...+ + +a1 b1c1 a2 b2c2 a3 b3c3 aR bRcR X Figure 3: CP tensor decomposition X U2 U1 U3 ≈ z Figure 4: CP tensor decomposition incorporating weights By breaking equation (8) further down into its elementwise form, we get xijk ≈ R∑ r=1 airbjrckr (9) Since we later refer to the elements of this sum as prob- ability values, we need to normalize the rank-one tensors ar, br and cr for r = 1, ..., R to length one. With this normalization, we are left with the following form xijk ≈ R∑ r=1 zrairbjrckr (10) where z ∈ RR is a weight-vector. This decomposition sometimes is called higher-order singular value decompo- sition (HOSVD) ([10]). The rank-one tensors ar, br and cr for r = 1, ..., R represent the singular values and can be used to derive a clustering of the data. ar, br and cr for r = 1, ..., R can be combined in ma- trices U1 ∈ RI×R, U2 ∈ RJ×R and U3 ∈ RK×R. We construct a super-diagonal tensor [z] ∈ RR×...×R of z con- taining just zeros apart from positions zk,...,k where it con- tains value zk. This allows us to write eq. (10) as n-mode product (see [11] for further information about tensor cal- culation): X ≈ ((([z]×1 U1)×2 U2)×3 U3) = [z] 3∏ i=1 ×iU i (11) Figure 4 visualizes the n-mode product for better under- standing. 3.2 METAFAC - Metagraph Factorization As mentioned before, the Metagraph is a description of a multi-relational graph G by means of a set of tensors {X (i)}. The objective of the METAFAC algorithm is to de- rive tensor decompositions of the X (i) with shared factors [z], U (q) which closely resemble the X (i). To measure the approximation, [16] proposed the Kullback Leibler diver- gence DKL [13], thus implying the following optimization problem: argmin [z],{U(q)} n∑ i=1 DKL(X (i), [z] ∏ i1,...,il(i) ×jU (ij)) (12) To solve for (12) the authors derived an approximation scheme by defining µ(i) = vec(X (i) ([z] l(i)∏ j=1 ×jU (ij))) (13) S(i) = fold(µ(i) ∗ (z ∗UMi ∗ · · · ∗U1i)T ) (14) where is the elementwise division of tensors, and ∗ is the Khatri-Rao product of matrices. These values are then be used to update z and the {U (q)} iteratively using z = 1 n n∑ i=1 acc(S(i),Mi + 1) (15) U q = ∑ l:el∼vq acc(S(i), q,Me + 1) (16) where acc is the accumulation-function of tensors and Mi + 1 is the last mode of tensor S(i). This update is car- ried out iteratively until the the sum in 4 converges. The optimization is shown in Algorithm 1. Algorithm 1 MF algorithm Input: Meta-Graph G = (V,E), data tensors {X (e)} Output: z and {Uq} procedure METAFAC( G, {X (e)}) Initialize z, {Uq} repeat for all e ∈ E do compute {S(e)} by eq. (13), (14) compute z by eq. (15) end for for all q ∈ V do update {Uq} by eq. (16) end for until convergence end procedure The batched version of the METAFAC approximation can be derived by using the KL-divergence in equations (5),(6). An appropriate approximation scheme has been proposed by the following update function: z = (1− α) n∑ i=1 acc(S(i),Mi + 1) + αzt−1 (17) U (q) = (1− α) ∑ l:el∼vq acc(S(j), q,Mi + 1) + αU (q)t−1 (18) 4 Stream-based Community Discovery with Tensors In this section we present our adaptions of the METAFAC framework by introducing a sampling-based tensor repre- sentation of graphs and using time-stamped relations to in- duce a decrease of impact of relations to reflect the decreas- ing importance of Twitter messages. 28 Given a social network we are provided with a sequence M of messages M := 〈m0,m1, . . .〉 where each message mi implies a set of relations R(mi). Let τ(mi) ≥ 0 be the arrival time of mi. This results in an overall sequence of relations S := 〈R(m0),R(m1), . . .〉 which are continuously added to the evolving social net- work graph G. Hence we are faced with a sequence 〈Gt0 , Gt1 , . . .〉 of graphs where each Gti contains the relations of all mes- sages up to time ti. Let t, t′ be points in time with t < t′. In the following we will byG[t,t′] denote the graph implied by only the mes- sages of time-span [t, t′], hence Gt = G[0,t]. Accordingly the graphs are represented by the corresponding tensor as G[t,t′] 7→ { X (1), . . . ,X (n) } [t,t′] . 4.1 The MFSTREAM Algorithm The METAFAC approach uses a sliding window of some fixed window sizews to manage streams. Given a sequence of time-points tj for j ∈ N with tj = tj−1 + ws, it factor- izes {X (i)}[tj−1,tj ] based on a trade-off factor α as denoted in equation (5). Our MFSTREAM algorithm interleaves the optimization of METAFAC by adding new relations during optimization and uses a time-based weighting function to take into ac- count the relations’ decreasing importance. Additionally, the optimization is carried out over only a partial set of re- lations as older relations tend to become obsolete for ad- justing the model. We will present the time-based weight- ing and the sampling strategy in the following and present the complete algorithm in 4.4. 4.2 Time-based Relation Weighting So far we considered the property of two or more entities to be related as binary property, i.e. if entities i, j and k are related, then X i,j,k = w, with w ∈ {0, 1}. With the extraction of relations from time-stamped messages – as provided within the Twitter platform – we are interested in incorporating the age of these relations to reflect the decreasing up-to-dateness of the information. Hence we associate each relation r ∈ Ri with a times- tamp τ(r) of the time at which this relation has been cre- ated (i.e. the time of the message from which it has been extracted). With S being a set of relations extracted from messages this leaves us with the tensor representation of relation type Ri = Vi1 × . . .× Vil(i) as X (i)i1,...,il(i) ={ τ(r) if r = (ϕ−11 (ν1), . . . , ϕ −1 l(i)(νl(i))) ∈ S 0 otherwise (19) In addition to that, we introduce a global clock, denoted by τmax, which represents the largest (i.e. the most recent) timestamp of all relations observed so far: τmax := max { τ(r) | r ∈ X } . Storing the timestamp τ(r) for each entry r in the tensors allows us to define a weighting function for the relations based on the global clock value. A simple example for a parametrized weighting function is given as ωα,β(r) := α α+ 1β (τmax − τ(r)) . (20) 4.3 Sampling The runtime of each iteration of the approximation scheme is basically manifested by the maximum numberN of non- zero entries in the tensors. To reduce the overall optimiza- tion time, we restrict the size of the tensors, i.e. number of entities of each type, by introducing constants Cq ∈ N and providing new entity mappings ϕq by ϕq : Vq → V q with V q = {1, . . . , Cq}. This has two implications: Clearly, these ϕq will not be bijective anymore if |Vq| > Cq . Moreover, the size of the U (q) matrices will also be limited to Cq × k. We deal with these imposed restrictions by defining dy- namic entity mappings ϕq , which maps a new entity e (i.e. an entity that has not been mapped before) to the next free integer of {1, . . . , Cq}. If no such element exists, we choose f ∈ {1, . . . , Cq} as the element that has longest been inactive, i.e. not been mapped to by ϕq . The rela- tions affected by f are then removed from all tensors and the current cluster model, i.e. U (q)f,i = 1 k ∀ i = 1, . . . , k. Effectively this introduces a “current window” {X (i)} of relations, that affect the adaption of the clustering in the next iteration. In contrast to the original METAFAC ap- proach this also frees us from having to know the number of entities and a mapping of the entities beforehand. 4.4 Continuous Integration With the prerequisites of section 4.2 and 4.3 we now present our stream adaption MFSTREAM as Algorithm 2. MFSTREAM is a purely dynamic approach of METAFAC which adds new relations to the data tensors {X (q)} and fits the model [z], {U (q)} after a specified number of T mes- sages. This differentiates our approach from METAFAC as the optimization is performed by running a single iter- ation of the optimization loop – with respect to the time- based weighting – after adding the relations of T messages to the tensors. The time complexity per iteration of the MFSTREAM algorithm is the same as for the METAFAC algorithms (see Section 3.2). Due to the fixed tensor di- mensions, the maximum number of non-zero elements N is constant, which implies O(1) runtime. 5 Evaluation For the evaluation of our approach we extracted relations of the Twitter website. Twitter is a blogging platform giving users the opportunity to inform other users by very small snippets of text containing a maximum of 140 characters. In spite of such limitations users are not only posting mes- sages – called tweets – but also enriching their tweets by tags, urls or mentions, which allows users to address other users. This brings up the entity types {user, tweet, tag, url }. To discover clusters on the above mentioned entities present on the Twitter platform, we constructed a meta- graph for Twitter as shown in Figure 2. The entity types 29 Algorithm 2 The MFSTREAM algorithm. 1: Input: MetaGraph G = (V,E), Stream M = 〈mi〉, capacities Cq , number of clusters k, constant T ∈ N 2: procedure MFSTREAM 3: Initialize z, {U (q)}, c := 0 4: while M 6= ∅ do 5: m := mc, c := c+ 1 . Pick the next message from the stream 6: for all (rj1 , . . . , rjl(j)) ∈ R(m) do 7: for all p = 1, . . . , l(j) do 8: if ϕp(rjp) = nil then . Replacement needed? 9: if |ϕp| = Cq + 1 then 10: f∗ := argminf∈ϕp τ(f) 11: U (p)f∗,s := 1 k ∀ s = 1, . . . , k 12: else 13: f∗ := min f∈{1,...,Cp} ϕ−1p (f) = nil . Pick next unmapped f ∗ 14: end if 15: ϕp(rjp) := f ∗, τ(f∗) := τ(m) 16: end if 17: end for 18: end for 19: νi := ϕp(rji) for i = 1, . . . , l(j) 20: X (p)ν1,...,νl(j) := τ(m) . Update corresponding tensor 21: if c ≡ 0 mod T then . Single opt.-iteration every T steps 22: for all i ∈ {1, . . . , n} do 23: compute {S(i)} by eq. (14) and (13) 24: update z by eq. (15) 25: end for 26: for all j ∈ {1, . . . , q} do 27: update {U (j)} by eq. (16) 28: end for 29: end if 30: end while 31: end procedure • R1: a user writing a tweet. • R2: a user writing a tweet containing a special tag. • R3: a user writing a tweet containing a special url. • R4: a user mentioning another user in a written tweet. • R5: a user writes a tweet containing a tag and an url. • R6: a user writing a tweet containing an url and mentioning another user. • R7: a user writes a tweet containing a tag and a mentioned user. • R8: a user mentioning another user in a tweet containing a tag and an url. themselves imply as much as P(V ) = 24 possible rela- tion types, some of which will not arise or are redundant. E.g. since each tweet is written by a user, there is no rela- tion (tweet,tag) which does not also refer to a user. Hence, our MetaGraph is based on the relation types {R1, . . . , R8} given as: We extracted 1000 seed users and their direct friends and followers. Followers are following a user which means that messages of the user are directly visible for the followers at their twitter website. Friends are all the users a partic- ular user is following. We used an English stopword filter to extract users which are writing in English language and processed all friends and followers of the seed users, re- vealing about 478.000 users. For these, we extracted all the messages written between the 19th and 23rd of Febru- ary 2010. Out of these 2.274.000 tweets we used the tweets written at the 19th of February for our experiments, leaving about 389.000 tweets from 41.000 users. 5.1 Evaluating the Model For a comparison of the clusterings produced by MF- STREAM and the METAFAC approaches we employ the “within cluster” point scatter [8]. This is given as W (C) := K∑ k=1 Nk ∑ C(i)=k ‖ xi − xk ‖2 (21) where K is the number of clusters, xi is a member of a cluster C(i) and xk is the centroid of a cluster k. It can be seen as a sum of dissimilarities between elements in the particular clusters. We created clusterings on a stream of 200k messages with MFSTREAM and restricted the tensor dimensions to Cuser = Ctweet = 5000 and Ctag = Curl = 1000. We employed several weighting functions such as ω1,1, ω10,1000 and ω100,1000 as well as a binary weighting which equals the unweighted model (i.e. w ∈ {0, 1}). To be able to compare the clusterings of MFSTREAM and METAFAC, we processed messages until the first en- tity type Vi reached its limit and stored the resulting clus- tering on disk. Then we reset the ϕ mappings and started anew, revealing a new clustering every time an entity type i reached Ci, revealing a total of 93 clusterings. We applied METAFAC on the messages that have been used to create these 93 clusterings and computed their similarities using W (C). Figure 5 shows that MFSTREAM delivers results comparable to METAFAC for different weighting functions. Figure 7 shows that using timestamped values instead of bi- nary values for calculation of the MFSTREAM delivers bet- ter results. The decrease of T , which implies a larger num- ber of optimization steps, intuitively increases the quality 30 Weighting W (C) (mean) std. deviation METAFAC 5.685 · 107 1.32 · 107 binary 7.511 · 107 2.00 · 107 ω1,1 6.142 · 107 1.57 · 107 ω1,1000 6.002 · 107 1.38 · 107 ω10,1000 6.272 · 107 1.43 · 107 ω100,1000 6.724 · 107 1.55 · 107 Figure 5: MeanW (C) of different weights (T = 20), com- paring MFSTREAM and METAFAC T W (C) (mean) std. deviation 5 6.043 · 107 1.35 · 107 10 6.133 · 107 1.36 · 107 50 6.058 · 107 1.40 · 107 100 6.465 · 107 1.49 · 107 250 10.882 · 107 3.13 · 107 500 43.419 · 107 11.47 · 107 Figure 6: Mean of W (C) with different update steppings T (weight used: ω1,1000) Figure 7: W (C) for MFSTREAM compared to the METAFAC clusterings. Figure 8: Relative W (C) of MFSTREAM using different update step sizes T of MFSTREAM as is attested by Figures 6 and 8. In addition, we made experiments to show the effect of the update frequency T . Figure 9 shows the relative run- time of MFSTREAM where T = 1 corresponds to the base- line at 1.0. Raising T results in shorter runtime, since the model is updated less frequently, which is the major time factor. The upper curve shows the runtime for updating af- ter 5 relations (T = 5), the middle one shows T = 10, and the latter refers to T = 50. Varying sizes of entity types by the Cq results in cluster- ings of different numbers of entities, which cannot be di- rectly compared by W (C). Hence, we normalized W (C) by the variance V of each clustering. Larger models of course incorporate more information, which results in more stable clusterings as can be seen in Figure 10. 6 Conclusion and Future Work In this work we presented MFSTREAM, a flexible algo- rithm for clustering multi-relational data from evolving net- works, derived from the METAFAC framework by [16]. The main improvement of our approach is the reduction of the approximation scheme on to a small relevant win- dow of relations. The proposed time-based weighting of relations contributes to this reduction by removing obso- lete information that is not relevant to the model adaption anymore. MFSTREAM is able to handle relations containing new, unseen entities by offering a replacement strategy for the set of entities considered at optimization time. This makes it especially suitable to continuously integrate new data from a stream. We evaluated MFSTREAM on real-world data crawled from the Twitter platform and showed its com- parability to METAFAC. The use of backend storage for off-loading obsolete data that can be re-imported into the optimization window at a later stage might be an interesting advancement. Also, con- current criteria runtime and quality offer a starting point for multi-objective optimization. Additionally, recent works [15] motivate further improvements to handle a dynamic number k of clusters within MFSTREAM. References [1] E. Acar, S. A. Çamtepe, M. S. Krishnamoorthy, and B. Yener. Modeling and multiway analysis of chat- room tensors. In ISI, pages 256–268, 2005. [2] E. Acar, S. A. Çamtepe, and B. Yener. Collective sam- pling and analysis of high order tensors for chatroom communications. In ISI, pages 213–224, 2006. [3] J. F. Allen. Towards a general theory of action and time. Artificial Intelligence, 23:123–154, 1984. [4] B. Bader, M. W. Berry, and M. Browne. Survey of Text Mining II, chapter Discussion tracking in Enron email using PARAFAC, pages 147–163. Springer, 2007. [5] A. Banerjee, S. Basu, and S. Merugu. Multi-way clus- tering on relation graphs. In SDM. SIAM, 2007. [6] D. Cai, X. He, and J. Han. Tensor space model for document analysis. In E. N. Efthimiadis, S. T. Du- mais, D. Hawking, and K. Järvelin, editors, SIGIR, pages 625–626. ACM, 2006. [7] R. Harshman. Foundations of the parafac procedure: Models and conditions for an “explanatory” multi- 31 Figure 9: Relative runtime of MFSTREAM using different numbers of relations for update. Figure 10: W (C)/V of MFSTREAM using different sizes of models. modal factor analysis. UCLA Working Papers in Pho- netics, 16, 1970. [8] T. Hastie, R. Tibshirani, and F. J. The elements of statistical learning-data mining, inference and predic- tion. Springer, Berlin Heidelberg New York, 2001. [9] H. Kiers. Towards a standardized notation and termi- nology in multiway analysis. Journal of Chemomet- rics, 14(3):105–122, 2000. [10] E. Kilmer and C. D. Moravitz Martin. Decomposing a tensor. SIAM News, 37(9), 2004. [11] T. G. Kolda and B. W. Bader. Tensor decomposi- tions and applications. SIAM Review, 51(3):455–500, September 2009. [12] T. G. Kolda, B. W. Bader, and J. P. Kenny. Higher- order web link analysis using multilinear algebra. In ICDM ’05: Proceedings of the Fifth IEEE Interna- tional Conference on Data Mining, pages 242–249, 2005. [13] S. Kullback and R. Leibler. On information and suffi- ciency. Annals of Mathematical Statistics, 22:79–86, 1951. [14] L. D. Lathauwer, B. D. Moor, and J. Vandewalle. A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl., 21(4):1253–1278, 2000. [15] Y.-R. Lin, J. Sun, N. Cao, and S. Liu. Contextour: Contextual contour visual analysis on dynamic multi- relational clustering. In Proceedings of the SIAM Conference on Data Mining (SDM10), 2010. Not published, yet. [16] Y.-R. Lin, J. Sun, P. Castro, R. Konuru, H. Sundaram, and A. Kelliher. Metafac: Community discovery via relational hypergraph factorization. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge discovery and data mining (SIGKDD 2009), pages 527–536, Paris, France, 2009. ACM. [17] A. Shashua and T. Hazan. Non-negative tensor fac- torization with applications to statistics and computer vision. In ICML ’05: Proceedings of the 22nd inter- national conference on Machine learning, pages 792– 799, New York, NY, USA, 2005. ACM. [18] J. Sun, D. Tao, and C. Faloutsos. Beyond streams and graphs: dynamic tensor analysis. In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 374–383, New York, NY, USA, 2006. ACM. [19] X. Wang, J.-T. Sun, Z. Chen, and C. Zhai. Latent se- mantic analysis for multiple-type interrelated data ob- jects. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 236– 243, New York, NY, USA, 2006. ACM. 32 Listing closed sets of strongly accessible set systems with applications to data mining∗ Mario Boley1,2 Tama´s Horva´th1,2 Axel Poigne´2 Stefan Wrobel1,2 1Dept. of Computer Science III, University of Bonn, Germany 2Fraunhofer IAIS, D-53754 Sankt Augustin, Germany {mario.boley,tamas.horvath,axel.poigne,stefan.wrobel}@iais.fraunhofer.de Abstract We study the problem of listing all closed sets of a closure operator σ that is a partial function on the power set of some finite ground set E, i.e., σ : F → F with F ⊆ P(E). A very simple divide-and-conquer algorithm is analyzed that correctly solves this problem if and only if the domain of the closure operator is a strongly accessible set system. Strong accessibility is a strict relaxation of greedoids as well as of inde- pendence systems. This algorithm turns out to have delay O(|E|(TF + Tσ + |E|)) and space O(|E| + SFSσ), where TF , SF , Tσ , and Sσ are the time and space complexities of checking membership in F and computing σ, respectively. In contrast, we show that the problem becomes intractable for accessible set systems. We relate our results to the data mining problem of listing all support-closed patterns of a dataset and show that there is a corresponding closure operator for all datasets if and only if the set system satisfies a certain confluence property. submissions. ∗A long version of this extended abstract appeared in Theoret- ical Computer Science 411(3), 691–700, 2010. 33 Mining Music Playlogs for Next Song Recommendations Andre Busche, Artus Krohn-Grimberghe, Lars Schmidt-Thieme University of Hildesheim, Germany {busche, artus, schmidt-thieme}@ismll.uni-hildesheim.de Abstract Recommender systems are popular social web tools, as they address the information overload problem and provide personalization of results [1]. This paper presents a large-scale collabora- tive approach to the crucial part in the playlist recommendation process: next song recommen- dation. We show that a simple markov-chain based algorithm improves performance com- pared to baseline models when neither content, nor user metadata is available. The lack of content-based features makes this task particu- larly hard. 1 Introduction Nowadays, mobile devices enable ubiquitous access to any piece of music any time and anywhere. Additionally, on- line stores offer millions of song tracks to those that want fresh content. But which of this plethora shall the store suggest or the user choose? How to cope with the informa- tion overload problem? Recommender systems come as a rescue: they aim at providing the user with relevant music content, based on personal listening preferences. Given the usage pattern for mobile music devices, in the extremes two different scenarios emerge that music recom- mender systems should tackle: on the one hand, there is the interactive listening scenario, where users actively use the system and want to influence the next song being played; on the other hand, there is the passive listening scenario, where music is just an ambient factor and the users want to receive full playlists instead of single tracks. In the interactive listening scenario, e.g., music online stores aim at increasing their revenue by recommending tracks to users who did not purchase these songs before. Another use case may be a music player automatically sug- gesting candidates for the next song to play. In the passive listening scenario, portable devices are of- ten used to listen to one’s own music in a passive way while doing something else, e.g., sports. Collaborative listen- ing portals like last.fm1 or Pandora2 automatically generate music playlists given some seed song or artist, and provide the user with a sequential but uncontrollable list of tracks to listen to. One of their aims is to get users in contact to novel tracks they did not know before, possibly with the aim to sell those. ‘Automatic playlist generation’ is involved in all of the cases above as a means of generating a probable sequence 1http://www.last.fm/ 2http://www.pandora.com/ or set of songs to play (next) given some fixed but poten- tially huge set of songs in a library. While all these applica- tions are of the same kind, the available data for them used to differ considerably: Online stores and listening portals usually had no information about the music library of the user and, traditionally, portable devices were only able to choose next tracks from the users’ music library for playlist generation. The availability of nearly ubiquitous internet, though, changed this picture where services like iTunes or Zune in combination with the respective devices have ac- cess to both, the user’s music library and the vast libraries of online stores. In this paper, we exploit massive collaborative usage data gathered on portable devices to recommend songs a user might want to listen to next. In the recommendation lists, songs available in the user’s library as well as songs not yet existing in his library are considered. Thus we allow for utility maximization of the user’s library and for cross- selling opportunities for the music store the user is con- nected to. We can show that collaborative recommendation tech- niques can greatly improve performance over baseline methods in situations where content features are unavail- able. Furthermore, we demonstrate that for next song rec- ommendation denser data yields better performance. 2 Related Work Playlist generation has been studied with great interest for about the last five years by many researchers, each focus- ing on different recommendation techniques. A content- based approach to derive playlists by providing the system with one start and end song has been introduced in [3]. The authors strive at providing smooth transitions between genres: Having MFCC coefficients[4] from the start and end song at hand, they model the songs by a single Gaus- sian and compute an ’optimal transition playlist’ by means of a (weighted) sum of ’optimal’ divergence ratios. Fig- uratively, their approach recommends those songs which are closest to the λ-weighted ’connection line’ between the songs. The evaluation in [3] is based on whether the generated playlists only contain songs from source genre, target genre, or both. However, it has been shown in [9] that while experts and social communities commonly agree when (manually) classifying songs between distinct genres at a coarse level (e.g. ’Classic’, or ’Rock’ songs), their agreement decreases when it comes to the classification of songs to subgenres. An algorithm for learning possible song transitions from radio station playlists was introduced in [5]. While their ap- proach is close to ours on a conceptual level, it is radically 35 different w.r.t. the features used. They use content fea- tures such as the MFCC coefficients and others to model song transition probabilities based on the content, rather than collaborative play information and available metadata as in our approach. To this end, they are more concerned enabling their approach to also work with previously un- known content, for which no collaborative data yet exists. This problem, known as the ’new-item’-problem for rec- ommender systems, as well as the ’new-user’ problem [8], is not handeled here. Additionally, [5] incorporate tag information when rec- ommending playlist items. Tags have become a corner- stone of the Web 2.0 and, furthermore, they increase rec- ommender performance for item prediction (e.g. next song prediction) [6]. A manual way to get ’gold-standard’ tags is described in [10]. Regrettably, tags are not available in our dataset. While there is much related work in the area of movie recommendation (e.g. [11]), we deem this task to be funda- mentally different: One the one hand, short video clips and music tend to be consumed in a session-oriented way. On the other hand, TV programme length movies are mostly consumed one at a time.For the former scenario, it can be assumed that within a session a user’s mood remains in more or less the same state. For the latter, the time spans are considered too large for this to be necessarily the case. 3 Methodology In this section we describe our technique for next song rec- ommendation based on playlogs. For training data we re- strict our algorithms to collaborative data. We use the Zune dataset which was sponsored by Microsoft Zune3. It con- tains an anonymized sample of multimedia play log entries from Sep ’08 to Feb ’09 plus additional metadata on the multimedia items. With 1.3 billion log entries this dataset is quite large, leveraging information from 1.3 million users u ∈ U on 44 million song tracks t ∈ T . Further does it con- tain playlists generated by external experts. These playlists comprise songs grouped by, e.g., themes. However, neither content-information nor tags are available in our sample. For the remainder of this work, we only consider a sub- set of the data: First, we only use songs, ignoring the other multimedia content. Second, of the songs we restrict our- selves to those contained in the predefined playlists. Fur- thermore, we only consider users who have played at least 60 songs, possibly spread over multiple sessions S. This filtering reduces the data to a ‘dense’ part, composed of approx. 500k users u and 26k tracks t. 18.4 million (1.4% of the original) log entries relate these users to items. This yields our largest training and sole evaluation dataset, ‘Playlog’. Our data also contains some hand-made playlists cre- ated by experts. Since the playlist information is explicitly given, we can directly transform them into sessions for later use. From the playlists, we derive two further datasets: • ‘Playlist (sequential)’: The sequential occurrence within the playlist, as initially ordered by the expert. • ‘Playlist (combinatorial)’: The set-based interpreta- tion of the playlists. We define the density of our data to be the fraction of present transitions to the amount of possible transitions 3http://www.zune.net density(S) := | S || T |2 with T being the filtered set of tracks (| T |∼ 26k). Let a ‘transition’ from song ti to song ti+1 happen iff song ti is played for its whole duration, and song ti+1 directly follows ti, possibly with some lag (see below). Table 1 gives an overview of the densities within our de- rived datasets. Technique Density Playlog 2.2% Playlists (sequential) 0.0001% Playlists (combinatorial) 0.003% Table 1: Varying Densities for the derived datasets 3.1 Reconstructing Session Data We define a session (consecutive plays of tracks t by a user u) as su,m = {t1, . . . , tTs} ∈ Su with m many sessions for user u. If the session su,m is obvious from the context, we omit the subscripts for readability. The tracks in one session are ordered w.r.t. ti ≺ ti+1 ≡ abs(end(ti)− start(ti+1)) < g ∀i = 1, . . . , Ts − 1, some g ≥ 0 and start(·) and end(·) being the start and end times- tamp from the logfile, respectivly. No explicit session information is available in the data. To derive session data, we empirically choose some g > 0 based on the lag distribution in the log file. The lag between two songs is assumed to be (close to) 0, if two songs are enqueued in a playlist and played after the other. This, however, is not always true, as depicted in Figure 1. 0 2000 4000 6000 8000 10000 12000 14000 16000 0 5 10 15 20 25 30 o cc u rr e n ce s transition time in seconds lag distribution frequency Figure 1: Gap between consecutive songs Two effects can easily be seen in the chart: First, there are several lags close to 0 seconds. Second, there is a small increase around the lag of 20 seconds. We interpret lags close to 0 to come from the data collection mechanism, that is, log entries are rounded to the nearest second, and the de- vices need some time to (pre)load the next song to play, or cross-fades are used. Gaps around the duration of 20 sec- onds are assumed to define ‘end-of-playlist’-markers, that is, the last song within a playlist ends, the user recognises ‘silence’ after a short amount of time, locates his device, and loads/starts another playlist. Since there is no indica- tion that a consecutive playlist is thematically close to the 36 one before, we assume g = 10 seconds for session split- ting. Doing so we are following standard literature: E.g., in [5], track transitions are mined from online radio stations (possibly also containing advertisements). They use lags of 20 seconds. In [7], it is assumed that a session breaks if the elapsed time between two songs exceeds 5 minutes. Using this technique, we yield 8.8 million sessions from the 18.4 million log entries. As it will become clear in Section 4, we only use those sessions having at least two tracks. We also require each user to contribute at least two sessions. This way we can avoid the ‘new user’ problem. Figure 2 shows the occurrences of different session lengths in our data (See the ‘all-pair’ line there). 10 100 1000 10000 100000 1e+006 10 100 O cc ur re nc es Session Length test pair distribution all pair distribution Figure 2: Distribution of session lengths 3.2 Baselines For comparison with the later presented algorithms, we use the following ‘most popular’ (MP) baseline methods for next song prediction: • most often downloaded/purchased songs, • most often played songs • most popular by user Our ’most popular by user’ baseline predicts those tracks to the user that he has listened to most. The baseline meth- ods are only trained on the playlog, do not consider se- quence information available in the logfile and always out- put a constant prediction. 3.3 Markov-based prediction We use a Markov-Chain model of length 1 to incorporate sequence information in the prediction task. We define the probability of the occurrence of the song ti+1, given track ti, as pˆ(ti+1|ti) = | {(t′i, t′i+1) ∈ S|t′i = ti, t′i+1 = ti+1} | | {(t′i, t′i+1) ∈ S|t′i = ti} | with S = ∪u∈USu when the algorithm is trained on ‘Playlog’ and When trained on the two playlists variants, S becomes the union of all transitions derived from the playlists, in either the sequential, or the combinatorial way as described in section 3. Taking Top-N predictions into account, we define our classifier as tˆi+1(ti, N) := arg maxNt∈T pˆ(ti+1 = t|ti) with a slight abuse of notation as we let arg max re- turn a (ranked) list of the N most probable tracks. Ties in pˆ(ti+1|ti) are broken arbitrarily. 4 Evaluation 4.1 Evaluation protocol We splitted our data using the ’leave-one-session-out’ paradigm. The main idea is to predict either the next or all following songs, given some start/seed songs in that ses- sion. In an application scenario, predicting the next song means that we are able to reconstruct the listening history of the user under changing conditions (the recommendation is adjusted each time a next song is played). By predicting multiple following songs, we aim at measuring whether we are able to capture the ‘mood’ or ‘style’ of the users’ lis- tening preferences for the current session. In both protocols, we average results over each session sm ∈ S (each having Ts many tracks) by choosing some ’test song’ tj with j+1 ∈ [1, Ts−1]. Training is done using all other sessions S \ {sm} , and songs {t1, . . . , tj} ∈ sm. Using this technique yielded 4.5 million training se- quences (consecutive song plays) and 8.8 million test se- quences. Please note that we have more test sequences than training sequences. This necessarily is the case, since we required j to be in [1, Ts − 1], and thus, considering a ses- sion of length 2 (our minimal required length), we always use song t1 as a seed for the recommendation algorithm to predict song t2. Figure 2 shows that the test set distribution (’test-pair’: the distribution of j + 1 plotted on x-axis) reflects the true session length distribution of the session data extracted from the log file. The reason for having more test se- quences of length 2 than actual sessions of length 2 is that each session of length greater than 2 can possibly be re- duced to create a test pair for j = 1. Predicting the next song The first evaluation focuses on next-song prediction, i.e. only song tj+1 should be predicted by our algorithm. To evaluate the performance, we calculate the precision of the first N recommended tracks P@N := avgsm∈SI(t sm j+1 ∈ tˆj+1(tj , N)) with I(·) being 1 if the argument is true, otherwise 0. The precision calculates whether one of the n recom- mended tracks matches the true following track. Top-N List intersection When evaluating against all following songs of a session, we use Top-N List intersection as evaluation measure. It is able to capture whether our recommendation algorithm cor- rectly identifies the ‘mood’ of the user. It evaluates whether tˆj+1(tj) matches for some o ≥ j + 1. Thus, the measure can also be thought of whether our algorithms are able to identify songs a user likes, or dislikes, given the current session/‘mood’. Its definition measures the overlap of the recommendations with all following songs within a session and is given by list(N) := ∑ sm∈S | tˆj+1(tj , N) ∩ {to ∈ sm : o ≥ j + 1} | min(| tˆj+1(tj , N) |, Ts − j,N) , 37 again with N = 5. Please note that both lists in the measure may contain less than 5 tracks. Possible reasons are recommendation of less than 5 tracks (|tˆj+1(tj , N)| < 5), or too few tracks to the end of the session (Ts − j < 5) in the test data. 4.2 Results Results for our experiments are given in Table 2. Algorithm P@5 list(5) Markovglobal 43.9 61.8 Markovplaylists sequential 11.5 17.6 Markovplaylists combinatorial 14.0 23.9 MPuser 3.2 7.9 MPdownload 1.2 1.8 MPplaycount 1.2 1.8 Table 2: Results for P@5 and list(5) measure in % A closer look at the predictions for MPdownload and MPplaycount revealed that the recommended tracks are identical. We consider our filtering process as described in Section 3 as being too restrictive. Furthermore, note that the most popular by user results are not further ranked: A user having heard each of his tracks only once introduces ties which can not be broken in a smart fashion. What is more interesting is a comparison between our Markov-based recommendations. For next song prediction, the global variant taking collaborative information into ac- count clearly exceeds both variants being trained on the predefined playlists. This is supported from the density es- timations given in Table 1, since playlists contain much less sequential information than the log files. The overall increase of performance for the combinato- rial playlist variant over the sequential one for both evalu- ation measures show that playlists within our data are used to group tracks, rather to reflect sequential transitions. Having a closer look at the performance of the Markovglobal recommender broken down by session lengths gives interesting insights to the data, as shown in Figure 3. 0 20 40 60 80 100 2 4 6 8 10 12 p@ 5 Session Length Markov Chain length 1 Figure 3: Results for different session length (j) for Markovglobal. As expected from a collaborative model, performance in- creases when the available historic data in the sessions also increases. While we have shown in Figure 2 that the total amount of test data decreases for increasing j, the over- all trend is still surprising, since we have not considered Markov-lengths > 1 here. We assume that the derived ses- sions being longer than 2 are based on random plays of only a few tracks, i.e. repetitive sequences are present among different sessions. 5 Conclusion and Future Work In this work we have presented our initial algorithm to rec- ommend songs to the user to listen to next. We showed that simple collaborative information greatly improves perfor- mance compared to baseline methods. As it has been shown in literature [2][5], incorporating actual content data might help in the recommendation task. However, up to our best knowledge, there is no clear and obvious way on how to do so. Furthermore, we currently limited our research to only use implicit metadata derived from listening history data, rather than using, e.g., user-defined tags to guide the rec- ommendation process. Further gathering of metadata, be it either content or tag data, will result in the need to de- fine (multiple) similarity measures, which need to be com- bined in some smart way to improve the recommendation process. Initial studies in this area have been done, e.g., in [5]. However, while their approach shows that learning and using such similarity measures can help, they also im- plicitly show that current content-based solutions have lots of space for further improvements. To this end, we aim at adding more metadata to our system. References [1] Adomavicius G., Tuzhilin, A.: Toward the Next Gener- ation of Recommender Systems: A Survey of the State- of-the-Art and Possible Extensions, IEEE Transactions on Knowledge and Data Engineering, Vol 17, No 6, June 2005 pp. 734-749 [2] Casey, M. A., et al.: Content-Based Music Information Retrieval: Current directions and Future Challenges, Proceedings of the IEEE, Vol 96, No. 4, April 2008, pp. 668-696. [3] Flexer, A., et al.: Playlist Generation Using Start and End Songs, ISMIR 2008, pp. 173-178. [4] Logan, B.: Mel Frequency Cepstral Coefficients for Music Modelling, ISMIR 2000. [5] Maillet, F., et al.: Steerable Playlist Generation by Learning Song Similarity from Radio Station Playlists, ISMIR 2009, pp. 345-350. [6] Nanopoulos A., Krohn-Grimberghe A.: Recommend- ing in Social Tagging Systems based on Kernelized Multiway Analysis, IFCS 2009. [7] Ragno, R., Burges, CJC, Herley, C.: Inferring similar- ity between music objects with application to playlist generation, ACM MIR 2005, pp. 73-80. [8] Schein et al.: Methods and metrics for cold-start rec- ommendations, SIGIR 2002, pp. 253-260. [9] Sordo et al.: The Quest for Musical Genres: Do the Experts and the Wisdom of the Crowds Agree?, ISMIR 2008, pp. 255-260. [10] Turnbull, D., Barrington, L., Lanckriet, G.: Five Ap- proaches to Collecting Tags for Music, ISMIR 2008, pp. 225-230. [11] Yang, B., et al.: Online video recommendation based on multimodal fusion and relevance feedback, ACM CIVR 2007, pp. 73-w80. 38 Graded Multilabel Classification: The Ordinal Case∗ Weiwei Cheng, Krzysztof Dembczynski, and Eyke Hu¨llermeier Marburg University Hans-Meerwein-Str., 35032 Marburg, Germany {cheng, dembczynski, eyke }@informatik.uni-marburg.de Abstract We propose a generalization of multilabel clas- sification that we refer to as graded multilabel classification. The key idea is that, instead of re- questing a yes-no answer to the question of class membership or, say, relevance of a class label for an instance, we allow for a graded member- ship of an instance, measured on an ordinal scale of membership degrees. This extension is moti- vated by practical applications in which a graded or partial class membership is natural. Apart from introducing the basic setting, we propose two general strategies for reducing graded multi- label problems to conventional (multilabel) clas- sification problems. Moreover, we address the question of how to extend performance metrics commonly used in multilabel classification to the graded setting, and present first experimental re- sults. 1 Introduction Problems of multilabel classification (MLC), in which an instance may belong to several classes simultaneously or, say, in which more than one label can be attached to a sin- gle instance, are ubiquitous in everyday life: At IMDb, a movie can be categorized as action, crime, and thriller, a CNN news report can be tagged as people and political at the same time, etc. Correspondingly, MLC has received increasing attention in machine learning in recent years. In this paper, we propose a generalization of MLC that we shall refer to as graded multilabel classification (GMLC). The key idea is that, instead of requesting a yes- no answer to the question of class membership or, say, rele- vance of a class label for an instance, we allow an instance to belong to a class to a certain degree. In other words, we allow for graded class membership in the sense of fuzzy set theory [Zadeh, 1965]. In fact, there are many applica- tions for which this extension seems to make perfect sense. In the case of movie genres, for example, it is not always easy to say whether or not a movie belongs to the cate- gory action, and there are definitely examples which can be considered as “almost action” or “somewhat action”. An- other obvious example comes from one of the benchmark data sets in MLC, namely the emotions data [Trohidis et al., 2008]. Here, the problem is to label a song according to the Tellegen-Watson-Clark model of mood: amazed-surprised, ∗This paper has been presented at International Conference on Machine Learning, Haifa, Israel, 2010. happy-pleased, relaxing-clam, quiet-still, sad-lonely, and angry-aggressive. It is important to emphasize that the relevance of a label is indeed gradual in the sense of fuzzy logic and not un- certain in the sense of probability theory. The latter would mean that, e.g., a song is either relaxing or it is not—one is only uncertain about which of these two exclusive al- ternatives is correct. As opposed to this, gradualness is caused by the vagueness of categories like “relaxing song” and “action movie”, and means that one does not have to fully agree on one of the alternatives. Instead, one can say that a song is somewhere in-between (and can be certain about this). As will be explained in more detail later on, our idea is to replace simple “yes” or “no” labels by membership degrees taken from a finite ordered scale such as M = { not at all, somewhat, almost, fully }. (1) Admittedly, graded multilabel data sets of that kind are not yet widely available. We believe, however, that this is a kind of hen and egg problem: As long as there are no meth- ods for learning from graded multilabel data, new data sets will be created in the common way, possibly forcing people to give a “yes” or “no” answer even when they are hesitat- ing. The rest of this paper is organized as follows: The prob- lem of multilabel classification is introduced in a more for- mal way in Section 2. In Section 3, we propose our graded generalization of MLC and, moreover, outline two differ- ent strategies for reducing GMLC problems to conventional (multilabel) classification problems. In Section 4, we ad- dress the question of how to extend MLC evaluation met- rics from the conventional to the graded setting. Finally, Section 5 presents some first experimental results. 2 Multilabel Classification Let X denote an instance space and let L = {λ1, λ2, . . . , λm} be a finite set of class labels. More- over, suppose that each instance x ∈ X can be associated with a subset of labels L ∈ 2L; this subset is often called the set of relevant labels, while the complement L \ L is considered as irrelevant for x. Given training data in the form of a finite set T of observations in the form of tu- ples (x, Lx) ∈ X × 2L, typically assumed to be drawn independently from an (unknown) probability distribution on X × 2L, the goal in multilabel classification is to learn a classifier H : X −→ 2L that generalizes well beyond these observations in the sense of minimizing the expected prediction loss with respect to a specific loss function; ex- amples of commonly used loss functions include the subset 39 zero-one loss, which is 0 if H(x) = Lx and 1 otherwise, and the Hamming loss that computes the percentage of la- bels whose relevance is incorrectly predicted: EH(H(x), Lx) = 1 |L| ∣∣H(x) ∆Lx∣∣, (2) where ∆ is the symmetric difference between sets. An MLC problem can be reduced to a conventional clas- sification problem in a straightforward way, namely by con- sidering each label subsetL ∈ 2L as a distinct (meta-)class. This approach is referred to as label powerset (LP) in the literature. An obvious drawback of this approach is the po- tentially large number of classes that one has to deal with in the newly generated problem; obviously, this number is 2|L| (or 2|L| − 1 if the empty set is excluded as a predic- tion). This is the reason why LP typically works well if the original label set L is small but quickly deteriorates for larger label sets [Tsoumakas and Vlahavas, 2007]. Another way of reducing multilabel to conventional clas- sification is offered by the binary relevance (BR) approach. Here, a separate binary classifierHi is trained for each label λi ∈ L, reducing the supervision to information about the presence or absence of this label while ignoring the other ones. For a query instance x, this classifier is supposed to predict whether λi is relevant for x (Hi(x) = 1) or not (Hi(x) = 0). A multilabel prediction for x is then given by H(x) = {λi ∈ L |Hi(x) = 1}. Since binary relevance learning treats every label independently of all other labels, an obvious disadvantage of this approach is its ignorance of correlations and interdependencies between labels. Many approaches to MLC learn a multilabel classifierH in an indirect way via a scoring function f : X× L −→ R that assigns a real number to each instance/label combina- tion. The idea is that a score f(x, λ) is in direct correspon- dence with the probability that λ is relevant for x. Given a scoring function of this type, multilabel prediction can be realized via thresholding: H(x) = {λ ∈ L | f(x, λ) ≥ t } , where t ∈ R is a threshold. As a byproduct, a scoring func- tion offers the possibility to produce a ranking (weak order) x of the class labels, simply by sorting them according to their score: λi x λj ⇔ f(x, λi) ≥ f(x, λj) . (3) Sometimes, this ranking is even more desirable as a predic- tion, and indeed, there are several evaluation metrics that compare a true label subset with a predicted ranking in- stead of a predicted label subset; an example is the rank loss which computes the average fraction of label pairs that are not correctly ordered: ER(f, Lx) = ∑ (λ,λ′)∈Lx×Lx S(f(x, λ), f(x, λ ′)) |Lx| × |Lx| , where Lx = L \ Lx is the set of irrelevant labels and S(u, v) = 1 if u < v, = 1/2 if u = v, and = 0 if u > v. The idea to solve both problems simultaneously, ranking and MLC, has recently been addressed in [Fu¨rnkranz et al., 2008]: A calibrated ranking is a ranking with a “zero point” separating a positive (relevant) part from a negative (irrelevant) one. 3 Graded Multilabel Classification Generalizing the above setting of multilabel classification, we now assume that each instance x ∈ X can belong to Figure 1: Vertical reduction, viz. prediction of membership degree (ordinate) for each label (abscissa). each class λ ∈ L to a certain degree. In other words, the set Lx of relevant labels is now a fuzzy subset of L. This fuzzy set is characterized by a membership function, namely an L −→ M mapping, where M is the set of graded mem- bership degrees. For notational simplicity, we shall not distinguish between the fuzzy set Lx and its membership function, and denote by Lx(λ) the degree of membership of the label λ ∈ L in the fuzzy set Lx. In fuzzy set theory, the set of membership degrees is sup- posed to form a complete lattice and is normally taken as the unit interval (i.e., M = [0, 1] endowed with the stan- dard order). Here, however, we prefer an ordinal scale of membership degrees, that is, a finite ordered set of member- ship degrees such as (1). More generally, we assume that M = {m0,m1, . . . ,mk}, where m0 < m1 < . . . < mk (and m0 = 0 and mk = 1 have the special meaning of zero and full membership). In the context of multilabel clas- sification, an ordinal membership scale is arguably more convenient from a practical point of view, especially with regard to data acquisition. In fact, people often prefer to give ratings on an ordinal scale like (1) instead of choosing precise numbers on a cardinal scale. The goal, now, is to learn a mapping H : X −→ F(L), where F(L) is the class of fuzzy subsets of L (with mem- bership degrees in M ). Following the general idea of re- duction [Balcan et al., 2008], we seek to make GMLC problems amenable to conventional multilabel methods via suitable transformations. There are two more or less obvi- ous possibilities to reduce graded multilabel classification to conventional (multilabel) classification. In agreement with the distinction between the “vertical” description of a fuzzy subset F of a set U (through the membership func- tion, i.e., by specifying the degree of membership F (u) for each element u ∈ U ) and the “horizontal” description (via level cuts [F ]α = {u ∈ U |F (u) ≥ α}), we distinguish between a vertical and a horizontal reduction. 3.1 Vertical Reduction Recall the binary relevance approach to conventional MLC: For each label λi ∈ L, a separate binary classi- fier Hi is trained to predict whether this label is relevant (Hi(x) = 1) or not (Hi(x) = 0) for a query instance x ∈ X. Generalizing this approach to GMLC, the idea is to induce a classifier Hi : X −→M (4) for each label λi. For each query instance x ∈ X, this clas- sifier is supposed to predict the degree of membership of λi in the fuzzy set of labels Lx. Instead of a binary clas- sification problem, as in MLC, each classifier Hi is now 40 Figure 2: Horizontal reduction, viz. prediction of a subset of labels (indicated by black circles) on each level. solving a multi-class problem. Since the target space M has an ordinal structure, these problems are ordinal clas- sification problems. In other words, the vertical reduction of a GMLC problem eventually leads to solving a set of m (non-independent) ordinal classification problems; see Fig. 1 for an illustration. Just like simple binary problems, ordinal classification problems are often solved indirectly via “scoring plus thresholding”: First, a scoring function f(·) is learned, and k thresholds t1, . . . , tk are determined; then, for an instance x, the i-th class is predicted if f(x, λi) is between ti−1 and ti. Of course, if classifiers (4) are learned in this way, i.e., by inducing a scoring function f(·, λi) for each label λi, then these scoring functions can also be used to predict a ranking (3). 3.2 Horizontal Reduction From fuzzy set theory, it is well-known that a fuzzy set F can be represented “horizontally” in terms of its level-cuts. This representation suggests another decomposition of a GMLC problem: For each level α ∈ {m1,m2, . . . ,mk}, learn the mapping H(α) : X −→ 2M , x 7→ [Lx]α . (5) Obviously, each of these problems is a standard MLC problem, since the level-cuts [Lx]α are standard subsets of the label set L. Thus, the horizontal reduction comes down to solving k standard MLC problems; see Fig. 2 for an il- lustration. It is worth mentioning that this decomposition comes with a special challenge. In fact, since level-cuts are nested in the sense that [F ]α ⊂ [F ]β for β < α, the k MLC prob- lems are not independent of each other. Instead, the predic- tions should be monotone in the sense that (H(mj)(x) = 1)⇒ (H(mj−1)(x) = 1) (6) for all j ∈ {2, . . . , k}. Thus, whenever a label λi is pre- dicted to be in the mj-cut of the fuzzy label set Lx associ- ated with x, it must also be in all lower level-cuts. Satis- fying this requirement is a non-trivial problem. In particu- lar, (6) will normally not be guaranteed when solving the k problems independently of each other. Once an ensemble of k multilabel classifiers H(m1), . . . ,H(mk) has been trained, predictions can be obtained as follows: H(x)(λ) = max{mi ∈M |λ ∈ H(mi)(x) } (7) Thus, the degree of membership of a label λ ∈ L in the predicted fuzzy set of labels associated with x is given by the maximum degree mi ∈ M for which λ is still in the predicted mi-cut of this set. The prediction of a ranking (3) is arguably less obvi- ous in the case of the horizontal decomposition. Suppose that f (m1), . . . , f (mk) are scoring functions trained on the k level cuts, using a conventional MLC method. As a coun- terpart to the monotonicity condition (6), we should require f (m1)(x, λ) ≥ f (m2)(x, λ) ≥ . . . ≥ f (mk)(x, λ) (8) for all x ∈ X and λ ∈ L. In fact, interpreting f (mi)(x, λ) as a measure of how likely λ is a relevant label on levelmi, this condition follows naturally from [Lx]m1 ⊃ [Lx]m2 ⊃ . . . ⊃ [Lx]mk . On each level mi, the function f (mi)(x, ·) induces a ranking (mi)x via (3), however, the identity (mi)x ≡(mj)x is of course not guaranteed; that is, (mi)x may differ from (mj)x for 1 ≤ i 6= j ≤ k. To obtain a global ranking, the level-wise rankings (mi)x need to be aggregated into a single one. To this end, we propose to score a label λ by f(x, λ) = k∑ i=1 f (mi)(x, λ) . (9) This aggregation is especially reasonable if the scores f (mi)(x, λ) can be interpreted as probabilities of relevance P(λ ∈ [Lx]mi). Then, f(x, λ) simply corresponds to the expected level of x, since k∑ i=1 f (mi)(x, λ) = k∑ i=1 P(λ ∈ [Lx]mi) = = k∑ i=1 P(Lx(λ) ≥ mi) = k∑ i=1 i ·P(Lx(λ) = mi) Note, however, that we simply equated the levels mi with the numbers i in this derivation, i.e., the ordinal scaleLwas implicitly embedded in a numerical scale by the mapping mi 7→ i (on L itself, an averaging operation of this kind is not even defined). Despite being critical from a theoret- ical point of view, this embedding is often used in ordinal classification, for example when computing the absolute er- ror AE(mi,mj) = |i − j| as a loss function [Lin and Li, 2007]. Interestingly, the absolute error is minimized (in ex- pectation) by the median and, moreover, this estimation is invariant toward rescaling [Berger, 1985]. Thus, it does ac- tually not depend on the concrete embedding chosen. Seen from this point of view, the median appears to be a theoreti- cally more solid score than the mean value (9). However, it produces many ties, which is disadvantageous from a rank- ing point of view. This problem is avoided by (9), which can be seen an approximation of the median that breaks ties in a reasonable way. 3.3 Combination of Both Reductions As mentioned above, the binary relevance approach is a standard (meta-)technique for solving MLC problems. Consequently, it can also be applied to each problem (5) produced by the horizontal reduction. Since BR can again be seen as a “vertical” decomposition of a regular MLC problem, one thus obtains a combination of horizontal and vertical decomposition: first horizontal, then vertical. Likewise, the two types of reduction can be combined the other way around, first vertical and then horizontal. This is done by solving the ordinal classification problems 41 produced by the vertical reduction by means of a “horizon- tal” decomposition, namely a meta-technique that has been proposed by [Frank and Hall, 2001]: Given an ordered set of class labels M = {m0,m1, . . . ,mk}, the idea is to train k binary classifiers. The i-th classifier considers the in- stances with label m0, . . . ,mi−1 as positive and those with label mi, . . . ,mk as negative. Interestingly, both combinations eventually coincide in the sense of ending up with the same binary classification problems. Roughly speaking, a single binary problem is solved for each label/level combination (λi,mj) ∈ L×M (each circle in the picture in Fig. 2), namely the problem to decide whether Lx(λi) ≤ mj or Lx(λi) > mj . Any dif- ference between the two approaches is then due to different ways of aggregating the predictions of the binary classi- fiers. In principle, however, such differences can only oc- cur in the case of inconsistencies, i.e., if the monotonicity condition (6) is violated. 3.4 Generalizing IBLR-ML Our discussion so far has been restricted to meta- techniques for reducing GMLC to MLC problems, without looking at concrete methods. Nevertheless, there are sev- eral methods that can be generalized immediately from the binary to the gradual case. As an example, we mention the IBLR-ML method that will also be used in our experiments later on. This method, which was recently proposed in [Cheng and Hu¨llermeier, 2009], combines instance-based learning with logistic regression and again trains one clas- sifier Hi for each label. For the i-th label λi, this classifier is derived from the logistic regression equation log ( pi (i) 0 1− pi(i)0 ) = ω (i) 0 + m∑ j=1 γ (i) j · ω(i)+j(x0) , (10) where pi(i)0 denotes the (posterior) probability that λi is rel- evant for x0, and ω (i) +j(x0) = ∑ x∈N (x0) κ(x0,x) · yj(x) (11) is a summary of the presence of the j-th label λj in the neighborhood of x0; here, κ is a kernel function, such as the (data-dependent) “KNN kernel” κ(x0,xi) = 1 if xi ∈ Nk(x0) and = 0 otherwise, where Nk(x0) is the set of k nearest neighbors of x0. Moreover, yj(x) = +1 if λj is present (relevant) for the neighbor x, and yj(x) = −1 in case it is absent (non-relevant). Obviously, this ap- proach is able to capture interdependencies between class labels: The estimated coefficient γ(i)j indicates to what ex- tent the relevance of label λi is influenced by the rele- vance of λj . A value γ (i) j > 0 means that the presence of λj makes the relevance of λi more likely, i.e., there is a positive correlation. Correspondingly, a negative co- efficient would indicate a negative correlation. Given a query instance x0, a multilabel prediction is made on the basis of the predicted posterior probabilities of relevance: H(x0) = {λi ∈ L |pi(i)0 > 1/2 }. This approach can be generalized to the GMLC setting using both the horizontal and the vertical reduction. The vertical reduction leads to solving an ordinal instead of a binary logistic regression problem for each label, while the horizontal reduction comes down to solving the following k multilabel problems (r = 1, . . . , k): log ( pi (i,r) 0 1− pi(i,r)0 ) = ω (i,r) 0 + m∑ j=1 γ (i,r) j ω (i,r) +j (x0) (12) Recall, however, that these problems are not independent of each other. Solving them simultaneously so as to guar- antee the monotonicity constraint (6) is an interesting but non-trivial task. In the experiments in Section 5, we there- fore derived independent predictions and simply combined them by (7). 4 Loss Functions As mentioned before, a number of different loss functions have already been proposed within the setting of MLC. In principle, all these functions can be generalized so as to make them applicable to the setting of GMLC. In this section, we propose extensions of some important and fre- quently used measures. Moreover, we address the question of how to handle these extensions in the context of the hor- izontal and vertical reduction technique, respectively. 4.1 Representation of Generalized Losses To generalize the Hamming loss (2), it is necessary to re- place the symmetric difference operator defined on sets, ∆, by the symmetric difference between two fuzzy sets. This can be done, for example, by averaging over the symmetric differences of the corresponding level-cuts, which in our case leads to E∗H(H(x), Lx) = ∑k i=1 ∣∣[H(x)]mi ∆ [Lx]mi∣∣ k|L| . (13) Note that this “horizontal” computation can be replaced by an equivalent “vertical” one, namely E∗H(H(x), Lx) = ∑|L| i=1 AE(H(x)(λi), Lx(λi)) k|L| , (14) where AE(·) is the absolute error of a predicted mem- bership degree which, as mentioned above, is defined by AE(mi,mj) = |i − j|. In other words, minimizing the symmetric difference level-wise is equivalent to minimiz- ing the absolute error label-wise. It is worth to mention that the existence of an equiva- lent horizontal and vertical representation of a loss func- tion, like in the case of (13) and (14), is not self-evident. For example, replacing in (14) the absolute error on the or- dinal scale M by the simple 0/1 loss leads to E∗0/1(H(x), Lx) = 1 |L| |L|∑ i=1 { 0 H(x)(λi) = Lx(λi) 1 H(x)(λi) 6= Lx(λi) . Just like (14), this is a typical vertical expression of a loss function, that is, an expression of the form A ( {` (H(x)(λi), Lx(λi))}|L|i=1 ) , where `(·) is a loss defined on L and A is an aggregation operator. Interestingly, E∗0/1 does not have an equivalent horizontal representation. Thus, there is provably no loss function L(·) on 2L (and aggregation A) such that E∗0/1(H(x), Lx) = A ( {L ([H(x)]mi , [Lx]mi)}ki=1 ) . This observation has an important implication. Namely, if the loss function to be minimized has a vertical but not a 42 horizontal representation, then a vertical decomposition of the learning problem is arguably more self-evident than a horizontal one, and vice versa. Strictly speaking, the non- existence of an equivalent representation does of course not exclude the existence of another loss function and aggrega- tion operator producing the same predictions. Such alter- natives, however, will normally be less obvious. As an example of a loss function that lends itself to a hor- izontal representation, consider a variant of the Hamming loss based on the well-known Jaccard-index: EJ(H(x), Lx) = |H(x) ∩ Lx| |H(x) ∪ Lx| (15) This variant avoids a certain disadvantage of the Hamming loss, which treats relevant and non-relevant labels in a sym- metric way even though the former are typically less nu- merous than the latter, thereby producing a bias toward the prediction of non-relevance. A natural generalization of this measure is obtained by averaging (15) over the levels: E∗J(H(x), Lx) = 1 k k∑ i=1 ∣∣[H(x)]mi ∩ [Lx]mi∣∣∣∣[H(x)]mi ∪ [Lx]mi∣∣ (16) This extension, however, does not admit an equivalent ver- tical representation, which is plausible since the Jaccard- index is indeed a genuine set measure. 4.2 Rank Loss The rank loss ER can be generalized in a canonical way by the so-called C-index, which is commonly used as a mea- sure of concordance in statistics [Gnen and Heller, 2005], and which is essentially equivalent to the pairwise ranking error introduced in [Herbrich et al., 2000]: E∗R(f, Lx) = ∑ i for nouns extracts the seed airport from the sentence Getting to and from coming weeks may be the problem, however. The extracted instances are added to the validation set of the respective OCE class. For each extracted verb form, we also add other verb forms matched in the text, which are automatically generated based on the bound CCEs for verbs (i.e. -(e)d, -s, and -ing). For extracted adjectives, we also add the adverb built using -ly if found in the text. After all rules have been applied, the input text is retagged based on the expanded validation sets. Tokens appearing in more than one validation list are assigned all possible tags, i.e. left ambiguous. Step 2: Rule Extraction New rule candidates are extracted using the validation sets generated in step 1. For each OCE class O, the method below is applied: 48 1. For each entry E in the validation set of O, match all in the text, where POS L (POS R) correspond to the POS tag of the left (right) context of E, i.e. an already tagged token directly pre- ceding (following) E. 2. Add to the set of rule candidates for O. Note that in the first iteration POS L and POS R represent some CCE tag. In later iterations, with some OCE tokens tagged based on the validation lists, POS L and POS R can also refer to an OCE tag. For illustration of the rule extraxtion procedure, consider the following example. For the entry airport from the validation set of nouns, we can extract the rule candidate from the sentence While closed , flights were diverted. This rule is then put to the set of rule candidates for nouns. Step 3: Rule Validation The rule candidates extracted in step 2 are validated by cal- culating the accuracy of each rule candidate r for OCE O using formula 1, acc(r) = posr + 1 posr + negr + 1 (1) where posr refers to the number of occurrences match- ing the pattern , negr refers to the number of occurrences matching the pattern , and 1 is a smoothing constant. XO refers to any token tagged with category O,X¬O refers to any token tagged with any open class other than O. If the calculated accuracy of a rule candidate exceeds a fixed threshold (currently set to 0.5), the rule candidate is added to the set of validated rules. The input text is retagged based on the validated rules and again, ambiguous tokens are tagged with all possible tags. 3.5 Postprocessing The postprocessing step serves two purposes: First, dis- ambiguate all OCE tokens tagged with more than one tag. Second, tag all those tokens not appearing in any of the val- idation sets and not covered by any of the learned rules. To disambiguate OCE tokens to which more than one tag has been assigned, we compare the scores of all rules matching the context of the token and apply the highest-scoring one. Several matching rules are possible if the context of the to- ken contains ambiguous tokens. For example, in order to tag the unknown token August in the sentence Trade figures fail to show a sub- stantial improvement from July near- record deficits, we need to decide whether to apply the rule (a verb rule) or (a noun rule). As score() = 0.94 and score() = 0.80, we decide to apply the higher scoring noun rule and tag August as N. To tag tokens that do not match any of the rules, we apply a backup procedure: We tag it based on its left context only. Here, we collect all rules with a matching left context, compute the average score of all rules for each of the tags in question, and assign the tag with the highest average score. The postprocessing step is iterated until all tokens have been assigned a single tag. 3.6 Noun group extraction After all tokens have been tagged, noun groups are col- lected by extracting all sequences of consecutive ADJ and N tokens. Note that all previous steps consider single to- kens, not token sequences, i.e. a learned rule cannot be ap- plied to extract a multi-word noun group directly. Instead, multi-word noun groups are extracted by learning and ap- plying rules that involve OCE tags, e.g. the learned rule for tagging nouns, which tags U.K. as N in the sentence But consumer expenditure data re- leased Friday don’t suggest that is slowing that quickly, given that economy has already been tagged as N. 4 Evaluation The algorithm was evaluated on sections 15 to 18 of the Wall Street Journal corpus, a commonly used corpus for part-of speech tagging and chunking tasks, e.g. the CONLL 2000 shared task on chunking.3 It contains 8,936 sentences with 46,874 noun groups (matching the regular expression JJ∗(NNP |NN |NNS)+). It is difficult to compare our system to others, which make use of more resources. The F-measure values of published results for the same dataset lie in the lower 90s, with a baseline F-measure of about 80 (cf. http://ifarm.nl/erikt/research/np-chunking.html). However, all these systems use a POS-annotated corpus as input, i.e., unlike our system, they require POS information to be available. Due to the difficulty of comparison to other results, we de- cided to evaluate the system by taking a look at the learning process, and evaluate the effect of the initial seed rules as well as the bootstrapping and postprocessing procedures. As baseline, we tagged all non-CCE tokens with the most probable tag, N, thus extracting all chunks occurring be- tween two CCEs as noun groups. We also evaluated the chunking result achieved using the initial validation sets extracted based on the four seed rules. Here, all tokens occurring in one of the initial validation sets were tagged accordingly, all other OCE tokens were tagged as N. In ad- dition, we evaluated the final tagging, which was generated by applying all rules validated by the bootstrapping process as described in 3.5. The bootstrapping stopped after 7 iter- ations. All results are presented in Tables 2 and 3. In the token-based evaluation, we look at the noun group to- kens individually and evaluate how many of them are cor- rectly considered part of a noun group by the algorithm. This evaluation procedure is similar to the one used for the CONLL 2000 shared task, which is also token-based. How- ever, we do not evaluate based on the BIO tagging scheme, but count matching noun group tokens, irrespective of their position within the chunk. The chunk-based evaluation is more strict in that it consid- ers complete chunks only, i.e. if two of three tokens in a noun group have correctly been assigned an N tag, they are not counted as a match because one token is missing, i.e. the complete chunk was not recognized. 3http://www.cnts.ua.ac.be/conll2000/chunking/ 49 Precision Recall F1 Baseline 0.65 0.97 0.78 Initial tagging 0.68 0.96 0.79 Final Tagging 0.74 0.94 0.83 Table 2: Token-based evaluation of the bootstrapping Precision Recall F1 Baseline 0.50 0.66 0.57 Initial tagging 0.54 0.68 0.60 Final Tagging 0.60 0.72 0.66 Table 3: Chunk-based evaluation of the bootstrapping 5 Discussion and future work We presented a first version of a self-adaptive noun chun- ker, which uses lists of closed-class elements, one seed ex- traction rule for each open class and a bootstrapping proce- dure to automatically generate and extend OCE validation sets and expand the set of extraction rules. An evaluation of the system’s learning progress showed the usefulness of the bootstrapping procedure. The results are preliminary as we are presenting ongoing work, improvements are ex- pected by optimizing the validation procedure. Currently, we only validate the rule candidates. Validating the ex- tracted OCE tokens before adding them to the validation lists would make the algorithm more robust and prevent ex- traction errors from being propagated. In addition, the ap- plication of more sophisticated rule validation techniques, e.g. EM-based confidence estimation as described and used by [Jones, 2005] and [Tomita et al., 2006], could improve the results. In the current system, bound CCEs only play a minor role: When building additional verb forms for the extracted verbs and when building adverbs from adjectives. In the future, we also want to use bound CCEs to generate rules dealing with the morphology of the OCE tokens (i.e. add a second type of extraction rule, similar to the spelling fea- tures used by [Collins and Singer, 1999]). We are currently evaluating the system on other, more spe- cialized corpora in order to show its domain-independency. We also plan to evaluate it on texts in other languages. In addition, the influence of the size of the input text needs to be evaluated. Acknowledgments The work described in this paper has been carried out in the context of the research project DiLiA, co-funded by the European Regional Development Fund in the context of Investitionsbank Berlins ProFIT program (grant num- ber: 10140159), and supported by a research grant from the German Federal Ministry of Economics and Technology to the project Theseus Ordo TechWatch (FKZ: 01MQ07016). We gratefully acknowledge this support. References [Collins and Singer, 1999] M. Collins and Y. Singer. Un- supervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empir- ical Methods in Natural Language Processing and Very Large Corpora, pages 100–111, Maryland, USA, 1999. [Eichler and Neumann, 2010] K. Eichler and G. Neu- mann. DFKI KeyWE: Ranking keyphrases extracted from scientific articles. In Proc. of the 5th International Workshop on Semantic Evaluations, ACL, 2010. [Evans and Pourcel, 2009] V. Evans and S. Pourcel, edi- tors. New Directions in Cognitive Linguistics. Human Cognitive Processing. John Benjamins, 2009. [Goldwater and Griffiths, 2007] S. Goldwater and T. Grif- fiths. A Fully Bayesian Approach to Unsupervised Part- of-Speech Tagging. In Proc. of the ACL, 2007. [Haghighi and Klein, 2006] A. Haghighi and D. Klein. Prototype-Driven Learning for Sequence Models. In Proc. of HLT/NAACL, 2006. [Ho¨hle and Weissenborn, 1999] B. Ho¨hle and J. Weis- senborn. Discovering grammar. In A.D.Friederici and R. Menzel, editors, Learning: Rule Extraction and Rep- resentation. Walter de Gruyter, Berlin, 1999. [Jones, 2005] Rosie Jones. Learning to Extract Enti- ties from Labeled and Unlabeled Texts. PhD thesis, School of Computer Science, Carnegie Mellon Univer- sity, 2005. [Merialdo, 1994] B. Merialdo. Tagging english text with a probabilistic model. Computational Linguistics, 20(2):155–171, 1994. [Riloff and Jones, 1999] E. Riloff and R. Jones. Learn- ing dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the 16th National Con- ference on AI (AAAI-99), Orlando, FL, 1999. [Spurk, 2006] C. Spurk. Ein minimal berwachtes Ver- fahren zur Erkennung generischer Eigennamen in freien Texten. Diplomarbeit, Saarland University, Germany, 2006. [Talmy, 2000] L. Talmy. Towards a cognitive semantics. MIT Press, Cambridge, MA, 2000. [Tomita et al., 2006] J. Tomita, S. Soderland, and O. Et- zioni. Expanding the recall of relation extraction by bootstrapping. In EACL Workshop on Adaptive Text Ex- traction and Mining, 2006. [Yangarber et al., 2002] R. Yangarber, L. Winston, and R. Grishman. Unsupervised learning of generalized names. In Proceedings of the 19th International Con- ference on Computational Linguistics (COLING 2002), 2002. 50 Towards Adjusting Mobile Devices to User’s Behaviour Fricke, P. ∗, Jungermann, F.∗, Morik, K.∗, Piatkowski, N.∗, Spinczyk, O.†, and Stolpe, M.∗ Technical University of Dortmund Abstract Mobile devices are a special class of resource- constrained embedded devices. Computing power, memory, the available energy, and net- work bandwidth are often severely limited. These constrained resources require extensive optimization of a mobile system compared to larger systems. Any needless operation has to be avoided. Time-consuming operations have to be started early on. For instance, loading files ideally starts before the user wants to access the file. So-called prefetching strategies optimize system’s operation. Our goal is to adjust such strategies on the basis of logged system data. Op- timization is then achieved by predicting an ap- plication’s behavior based on facts learned from earlier runs on the same system. In this paper, we analyze system-calls on operating system level. The learned model predicts if a system-call is go- ing to open a file fully, partially, or just for chang- ing its rights. 1 Introduction Users demand mobile devices to have long battery life, short application startup time, and low latencies. Mobile devices are constrained in computing power, memory, en- ergy, and network connectivity. This conflict between user expectations and resource constraints can be reduced, if we tailor a mobile device such that it uses its capacities care- fully for exactly the user’s needs, i.e., the services, that the user wants to use. Predicting the user’s behavior given pre- vious behavior is a machine learning task. For example, based on the learning of most often used file path compo- nents, a system may avoid unnecessary probing of files and could intelligently prefetch files. Prefetching those files, which soon will be read by the user, leads to decreased startup latencies for applications and, accordingly, conser- vation of energy. The resource restrictions of mobile devices motivate the application of machine learning for predicting user behav- ior. At the same time, machine learning dissipates re- sources. There are three critical resource constraints: ∗Artificial Intelligence Group Baroper Strasse 301, Dortmund, Germany {fricke,jungermann,morik,piatkowski,stolpe}@ls8.cs.tu- dortmund.de †Embedded System Software Group Otto-Hahn-Strasse 16, Dortmund, Germany olaf.spinczyk@tu-dortmund.de • Data gathering: logging user actions uses processing capacity. • Data storage: the training and test data as well as the learned model use memory. • Communication: if training and testing is performed on a central server, sending data and the resulting model uses the communication network. • Response time: the prediction of usage, i.e., the model application, has to happen in short real-time. The dilemma of saving resources at the device through learning which, in turn, uses up resources, can be solved in several ways. Here, we set aside the problem of data gathering and its prerequisites on behalf of operation sys- tems for embedded systems [Lohmann et al., 2009] [Tartler et al., ] [Cantrill et al., 2004].This is an important issue in its own right. Regarding the other restrictions, especially the restriction of memory, leads us to two alternatives. Server-based learning: The learning of usage profiles from data is performed on a server and only the result- ing model is communicated back to the device. Learn- ing is less restricted in runtime and memory consump- tion. Just the learned model must obey the runtime and communication restrictions. Hence, a complex learning method is applicable. Figure 1 shows this alternative. Device-based learning: The learning of usage profiles on the device is severely restricted in complexity. It does not need any communication but requires training data to be stored. Data streaming algorithms come into play in two alternative ways. First, descriptive algo- rithms incrementally build-up a compact way to store data. They do not classify or predict anything. Hence, in addition, simple methods are needed that learn from the aggregated compact data. Second, simple online algorithms predict usage behavior in realtime. The lat- ter option might only be possible if specialized hard- ware is used, e.g., General Purpose GPUs. Figure 2 shows this alternative. In this paper, we want to investigate the two alternatives using logged system calls. Server-based learning is exem- plified by predicting file-access types in order to enhance prefetching. It is an open question whether structural mod- els are demanded for the prediction of user behavior on the basis of system calls, or simpler models such as Naive Bayes suffice. Should the sequential nature of system calls be taken into account by the algorithm? Or is it sufficient to encode the sequences into the features? Or should features as well as algorithm be capable of explicitly addressing se- 51 Figure 1: Server-based Architecture Figure 2: Device-based Architecture quences? We investigate the use of two extremes, Con- ditional Random Fields (CRF) and Naive Bayes (NB). In particular, we inspect their memory consumption and run- time, both, for training and applying the learned function. Section 2 presents the study of server-based learning for ubiquitous devices. We derive the learning task from the need of enhancing prefetching strategies, describe the log data used, and present the learning results together with re- source consumptions of NB and CRF. Device-based learning is exemplified by recognizing ap- plications from system calls in order to prevent fraud. We apply the data streaming algorithm Hierarchical Heavy Hit- ters (HHH) yielding a compact data structure for storage. Using these, the simple kNN method classifies systems calls. In particular, we investigate how much HHH com- press data. Section 3 presents the study of device-based learning using a streaming algorithm for storing compact data. We conclude in Section 4 by indicating related and future work. 2 Server-based Learning In this section we present the first case-study, where log data are stored and analyzed on a server (data are described in Section 2.2). Learning aims at predicting file access in order to prefetch files (see Section 2.1). The learning meth- ods NB and CRF are introduced shortly in Section 2.3 and Section 2.4, respectively. The results are shown in Section 2.5. 2.1 File-access pattern prediction A prediction of file-access patterns is of major importance for the performance and resource consumption of system software. For example, the Linux operating system uses a large “buffer cache” memory for disk blocks. If a requested disk block is already stored in the cache (cache hit), the op- erating system can deliver it to the application much faster and with less energy consumption than otherwise (cache miss). In order to manage the cache the operating system has to implement two strategies, block replacement and prefetching. The block replacement strategy is consulted upon a cache miss: a new block has to be inserted into the cache. If the cache is already full, the strategy has to decide which block has to be replaced. The most effective victim is the one with the longest forward distance, i.e. the block with the maximum difference between now and the time of the next access. This requires to know or guess the future sequence of cache access. The prefetching strategy proac- tively loads blocks from disk into the cache, even if they have not been requested by an application, yet. This of- ten pays off, because reading a bigger amount of blocks at once is more efficient than multiple read operations. How- ever, prefetching should only be performed if a block will be needed in the near future. For both strategies, block replacement and prefetching, a good prediction of future application behavior is crucial. Linux and other operating systems still use simple heuristic implementations of the buffer cache management strategies. For instance, the prefetching code in Linux [Bovet and Cesati, 2005] continuously monitors read op- erations. As long as a file is accessed sequentially the read ahead is increased. Certain upper and lower bounds restrict the risk of mispredictions. This heuristics has two flaws: • No prefetching is performed before the first read op- eration on a specific file, e.g., after “open”, or even earlier. • The strategy is based on assumptions on typical disk performance and buffer cache sizes, in general. How- ever, these assumptions might turn out to be wrong in certain application areas or for certain users. Prefetching based on machine learning avoids both prob- lems. Prefetching can already be performed when a file is opened. It only depends on the prediction that the file will be read. The prediction is based on empirical data and not on mere assumptions. If the usage data change, the model changes, as well. 2.2 System Call Data for Access Prediction We logged streams of system calls of type FILE, which consist of various typical sub-sequences, each starting with an open- and terminating with a close-call, like those shown in Figure 3.We collapsed such sub-sequences to one observation and assign the class label • full, if the opened file was read from the first seek (if any) to the end, • read, if the opened file was randomly accessed and • zero, if the opened file was not read after all. We propose the following generalization of obtained file- names. If a file is regular, we remove anything except the filename extension. Directory names are replaced by ”DIR”, except for paths starting with ”/tmp” – those are replaced by ”TEMP”. Any other filenames are replaced by ”OTHER”. This generalization of filenames yields good results in our experiments. Volatile information like thread-id, process-id, parent-id and system-call parameters is dropped, and consecutive observations are compound to one sequence if they belong to the same process. The re- sulting dataset consists of 673887 observations in 80661 sequences, a snippet1 is shown in Table 1. We used two feature sets for the given task. The first en- codes information about sequencing as features, resulting in 24 features, namely ft, ft−1, ft−2, ft−2/ft−1, ft−1/ft, 1The final dataset is available at: http://www-ai.cs.tu-dortmund.de/ PUBDOWNLOAD/MUSE2010 52 1,open,1812,179,178,201,200,eclipse,/etc/hosts,524288,438,7 : 361, full 2,read,1812,179,178,201,200,eclipse,/etc/hosts,4096,361 3,read,1812,179,178,201,200,eclipse,/etc/hosts,4096,0 4,close,1812,179,178,201,200,eclipse,/etc/hosts Figure 3: A sequence of system calls to read a file. The data layout is: timestamp, syscall, thread-id, process-id, parent, user, group, exec, file, parameters (optional) : read bytes, label (optional) user group exec file label 20005 10000 firefox- bin cookies.sqlite- journal zero 20005 10000 firefox- bin default zero 20005 10000 firefox- bin hosts full 20005 10000 firefox- bin hosts full 20005 10000 multiload- apple mtab full 10028 10000 kmail png zero Table 1: Snippet of the final dataset. predicted\true full zero read full 0 2 1 zero 5 0 4 read 4 2 0 Table 2: Cost matrix ft−2/ft−1/ft, with f ∈ {user, group, exec, file}. The second feature set simply uses two features exect−1/exect and filet−2/filet−1/filet as its only features. Errors in predicting the types of access result in different degrees of failure. Predicting a partial caching of a file, if just the rights of a file have to be changed, is not as prob- lematic as predicting a partial read if the file is to be read completely. Hence, we define a cost-matrix (see Table 2) for the evaluation of our approach. 2.3 Naive Bayes Classifier The Naive Bayes classifier [Hastie et al., 2003] assigns la- bels y ∈ Y to examples x ∈ X . Each example is a vector of m attributes written here as xi, where i = 1...m. The probability of a label given an example is according to the Bayes Theorem: p(Y |x1, x2, ..., xm) = p(Y )p (x1, x2, ..., xm|Y ) p (x1, x2, ..., xm) (1) Domingos and Pazzani [Domingos and Pazzani, 1996] rewrite eq. (1) and define the Simple Bayes Classifier (SBC): p(Y |x1, x2, ..., xm) = p(Y ) p (x1, x2, ..., xm) n∏ j=1 p (xj |Y ) (2) The classifier delivers the most probable class Y for a given example x = x1 . . . xm: argmax Y p(Y |x1, x2, ..., xm) = p(Y ) p (x1, x2, ..., xm) m∏ j=1 p (xj |Y ) (3) The term p (x1, x2, ..., xm) can be neglected in eq. (3) be- cause it is a constant for every class y ∈ Y . The decision for the most probable class y for a given example x just de- pends on p(Y ) and p (xi|Y ) for i = 1 . . .m. These proba- bilities can be calculated after one run on the training data. So, the training runtime is O(n), where n is the number of examples in the training set. The number of probabilities to be stored during training are |Y| + (∑mi=1 |Xj | ∗ |Y|), where |Y| is the number of classes and |Xi| is the number of different values of the ith attribute. The storage require- ments for the trained model are O(mn). It has often been shown that SBC or NBC perform quite well for many data mining tasks [Domingos and Pazzani, 1996; Huang et al., 2003; Frank and Asuncion, 2010]. 2.4 Linear-chain Conditional Random Fields Linear-chain Conditional Random Fields, introduced by Lafferty et al. [Lafferty et al., 2001], can be understood as discriminative, sequential version of Naive Bayes Classi- fiers. The conditional probability for an actual sequence of labels y1,y2, ...,ym, given a sequence of observa- tions x1,x2, ...,xm is modeled as an exponential family. The underlying assumption is that a class label at the cur- rent timestep t just depends on the label of its direct ances- tor, given the observation sequence. Dependency among the observations is not explicitly represented, which allows the use of rich, overlapping features. Equation 4 shows the model formulation of linear-chain CRF pλ (Y = y|X = x) = 1 Z (x) T∏ t=1 exp (∑ k λkfk (yt, yt−1,x) ) (4) with the observation-sequence dependent normalization factor Z (x) = ∑ y T∏ t=1 exp (∑ k λkfk (yt, yt−1,x) ) (5) The sufficient statistics or feature functions fk are most often binary indicator functions which evaluate to 1 only for a single combination of class label(s) and attribute value. The parameters λk can be regarded as weights or scores for this feature functions. In linear-chain CRF, each attribute value usually gets |Y| + |Y|2 parameters, that is one score per state-attribute pair as well as one score for every transition-attribute triple, which results in a total of∑m i=1 |Xi| (|Y|+ |Y|2)model parameters, where |Y| is the number of classes, m is the number of attributes and |Xi| is the number of different values of the ith attribute. Notice that the feature functions explicitly depend on the whole observation-sequence rather than on the attributes at time t. Hence, it is possible and common to involve attributes of preceding as well as following observations from the current sequence into the computation of the total score 53 exp ( ∑ k λkfk (yt, yt−1,x)) for the transition from yt−1 to yt given x. The parameters are usually estimated by the maximum- likelihood method, i.e., maximizing the conditional like- lihood (Eq. 6) by quasi-Newton [Malouf, 2002], [Sha and Pereira, 2003], [Nocedal, 1980] or stochastic gradi- ent methods [Vishwanathan et al., 2006], [Schraudolph and Graepel, 2002], [Schraudolph et al., 2007]. L (λ) = N∏ i=1 pλ(Y = y (i)|X = x(i)) (6) The actual class prediction for an unlabeled observation- sequence is done by the Viterbi algorithm known from Hid- den Marcov Models [Sutton and McCallum, 2007], [Ra- biner, 1989]. Although CRF in general allow to model arbitrary de- pendencies between the class labels, efficient exact infer- ence can solely be done for linear-chain CRF. This is no problem here, because they match the sequential structure of our system-call data, presented in section 2.2. 2.5 Results of Server-based Prediction Comparing the prediction quality of the simple NB models and the more complex CRF models, surprisingly, the CRF are only slightly better when using the two best features (see Tables 3 and 5). CRF outperforms NB when using all features (see Tables 4 and 6). These two findings in- dicate that the sequence information is not as important as we expected. Neither encoding the sequence into features nor applying an algorithm which is made for sequential in- formation outperforms a simple model. The Tables show that precision, recall, accuracy, and misclassification cost are quite homogeneous for CRF, but vary for NB. In par- ticular, the precision of predicting “read” and the recall of class “zero” differs from the numbers for the other classes, respectively. This makes CRF more reliable. Inspecting resource consumption, we stored models of the two methods for both feature sets and for various num- bers of examples to show the practical storage needs of the methods.Table 9 presents the model sizes of the naive Bayes classifier on both feature sets and for various exam- ple set sizes. We used the popular open source data min- ing tool RapidMiner2 for these experiments. Table 9 also shows the model sizes of CRF on both feature sets and var- ious example set sizes. We used the open source CRF implementation CRF++3 with L2-regularization, σ = 1 and L-BFGS optimizer in all CRF experiments. Obviously, the storage needs for a model produced by a NB classifier are lower than those for a CRF model. This is the price to be paid for more reliable prediction quality. CRF don’t scale-up well. Considering training time, the picture becomes worse. Table 10 shows the training time of linear-chain or HMM-like CRF con- suming orders of magnitude more time than NB. 3 Device-based Learning In this section, we present the second case-study, where streams of log data are processed in order to store patterns 2RapidMiner is available at: http://www.rapidminer.com 3CRF++ is available at: http://crfpp.sourceforge.net/ predicted\true full zero read prec. full 1427467 19409 3427 98.43 zero 12541 2469821 40258 97.91 read 80872 217380 2467695 89.22 recall 93.86 91.25 98.26 Table 3: Result of Naive Bayes Classifier on best two fea- tures, 10x10-fold cross-validated, accuracy: 94.45 ± 0.00, missclassification costs: 0.152 ± 0.001 full zero read prec. 1426858 21562 22717 96.99 15392 2371009 97566 95.45 78630 314039 2391097 85.89 93.82 87.60 95.21 Table 4: Result of Naive Bayes Classifier on all 24 fea- tures, 10x10-fold cross-validated, accuracy: 91.84 ± 0.00, missclassification costs: 0.218 ± 0.002 predicted\true full zero read prec. full 1446242 7123 29051 97.56 zero 19452 2639097 133007 94.54 read 55186 60390 2349322 95.31 recall 95.09 97.51 93.55 Table 5: Result of HMM-like CRF on the best two fea- tures, 10x10-fold cross-validated, accuracy: 95.49 ± 0.00, missclassification costs: 0.150 ± 0.000 full zero read prec. 1450147 8335 25629 97.71 14563 2639724 126403 94.93 56170 58551 2359348 95.36 95.35 97.53 93.95 Table 6: Result of HMM-like CRF on all 24 features, 10x10-fold cross-validated, accuracy: 95.70 ± 0.00, miss- classification costs: 0.143 ± 0.000 predicted\true full zero read prec. full 1467440 4733 7503 99.17 zero 10883 2659294 108340 95.71 read 42557 42583 2395537 96.57 recall 96.49 98.25 95.39 Table 7: Result of linear-chain CRF on the best two fea- tures, 10x10-fold cross-validated, accuracy: 96.79 ± 0.00, missclassification costs: 0.112 ± 0.000 full zero read prec. 1468095 4117 5022 99.38 10306 2662966 107859 95.75 42479 39527 2398499 96.69 96.53 98.39 95.51 Table 8: Result of linear-chain CRF on all 24 features, 10x10-fold cross-validated, accuracy: 96.89 ± 0.00, miss- classification costs: 0.110 ± 0.000 of system use. The goal is to aggregate the streaming sys- tem data. A simple learning method might then use the ag- gregated data. The method of Hierarchical Heavy Hitters (HHH) is defined in Section 3.1. The log data are shown in Section 3.2. For the comparison of different sets of HHH, we present a distance measure that allows for clustering or classifying sets of HHH. In addition to the quality of our HHH application, its resource consumption is presented in Section 3.3. 3.1 Hierarchical Heavy Hitters The heavy hitter problem consists of finding all frequent el- ements and their frequency values in a data set. According 54 #Att.\#Seq. 0 67k 135k 202k 270k 337k 404k 472k 539k 606k 674k 2 nB 2 78 100 118 132 143 154 161 169 176 181 24 nB 5 247 310 355 392 417 448 469 488 505 517 2 CRF++ (HMM) 5 247 366 458 490 512 569 592 614 634 649 24 CRF++ (HMM) 12 615 878 1102 1170 1216 1367 1420 1463 1521 1551 2 CRF++ 6 523 776 978 1043 1089 1213 1260 1299 1345 1378 24 CRF++ 19 1339 1914 2415 2559 2652 2988 3095 3184 3303 3365 Table 9: Storage needs (in kB) of the naive Bayes (nB), the HMM-like CRF (CRF++ (HMM)) and the linear-chain CRF (CRF++) classifier model on different numbers of sequences and attributes. #Att.\#Seq. 0 67k 135k 202k 270k 337k 404k 472k 539k 606k 674k 2 nB < 1 < 1 < 1 < 1 1 < 1 < 1 < 1 < 1 < 1 < 1 24 nB < 1 < 1 < 1 1 < 1 1 1 1 1 2 1 2 CRF++ (HMM) < 1 9.09 28.56 44.08 60.1 75.76 107.28 127.04 149.95 165.94 199.2 24 CRF++ (HMM) < 1 27.92 55.9 103.24 153.53 160.33 230.7 273.29 232.84 309.19 317.62 2 CRF++ < 1 16.69 50.23 85.18 113.21 145.96 173.56 200.98 234.65 260.56 325.54 24 CRF++ < 1 41.06 105.29 156.67 296.31 300.83 343.28 433.03 440.88 463.84 632.96 Table 10: Training time (in seconds) of the naive Bayes (nB), the HMM-like CRF (CRF++ (HMM)) and the linear-chain CRF (CRF++) classifier model on different numbers of sequences and attributes. to Cormode [Cormode et al., 2003], given a (multi)set S of size N and a threshold 0 < φ < 1, an element e is a heavy hitter if its frequency f(e) in S is not smaller than bφNc. The set of heavy hitters is then HH = {e|f(e) ≥ bφNc}. If the elements in S originate from a hierarchical domain D, one can state the following problem [Cormode et al., 2003]: Definition 1 (HHH Problem) Given a (multi)set S of size N with elements e from a hierarchical domain D of height h, a threshold φ ∈ (0, 1) and an error parameter  ∈ (0, φ), the Hierarchical Heavy Hitter Problem is that of identifying prefixes P ∈ D, and estimates fp of their as- sociated frequencies, on the first N consecutive elements SN of S to satisfy the following conditions: • accuracy: f∗p − εN ≤ fp ≤ f∗p , where f∗p is the true frequency of p in SN . • coverage: all prefixes q 6∈ P satisfy φN > ∑ f(e) : (e  q) ∧ (6 ∃p ∈ P : e  p). Here, e  p means that element e is generalizable to p (or e = p). For the extended multi-dimensional heavy hitter problem introduced in [Cormode et al., 2004], el- ements can be multi-dimensional d-tuples of hierarchical values that originate from d different hierarchical domains with depth hi, i = 1, . . . , d. There exist two variants of algorithms for the calculation of multi-dimensional HHHs: Full Ancestry and Partial Ancestry, which we have both im- plemented. For a detailed description of these algorithms, see [Cormode et al., 2008]. 3.2 System Call Data for HHH The kernel of current Linux operating systems offers about 320 different types of system calls to developers. Having gathered all system calls made by several applications, we observed that about 99% of all calls belonged to one of the 54 different call types shown in Table 11. The functional categorization of system calls into five groups is due to [Sil- berschatz et al., 2010]. We focus on those calls only, since the remaining 266 call types are contained in only 1% of the data and therefore can’t be frequent. HHHs can handle values that have a hierarchical struc- ture. We have utilized this expressive power by represent- ing system calls as tuples of up to three hierarchical feature values. Each value originates from a taxonomy (type, path or sequence) that either can be derived dynamically from FILE COMM PROC INFO DEV open recvmsg mmap2 access ioctl read recv munmap getdents write send brk getdents64 lseek sendmsg clone clock gettime llseek sendfile fork gettimeofday writev sendto vfork time fcntl rt sigaction mprotect uname fcntl64 pipe unshare poll dup pipe2 execve fstat dup2 socket futex fstat64 dup3 accept nanosleep lstat close accept4 lstat64 stat stat64 inotify init inotify init1 readlink select Table 11: We focus on 54 system call types which are functionally categorized into five groups. FILE: file sys- tem operations, COMM: communication, PROC: process and memory management, INFO: informative calls, DEV: operations on devices. the data itself or has to be defined explicitly by the user. The groups introduced in Table 11 form the top level of the taxonomy for the hierarchical variable type (see Fig. 4). The socket call is a child of group COMM and FILE is the parent of calls like open and fcntl64. Subtypes of system calls can be defined by considering the possi- ble values of their parameters. For example, the fcntl64 call which operates on file descriptors has fd, cmd and arg as its parameters. We have divided the 16 different nominal values of the cmd parameter into seven groups — notify, dflags, duplicate, sig, lock, fflags and lease — that have become the children of the fcntl64 sys- tem call in our taxonomy (see Fig. 4). One may further divide fcntl64 calls of subtype fflags by the values F SETFL and F GETFL of the arg parameter. In the same way, we defined parents and children for each of the 54 call types and their parameters. The path variable is filled whenever a system call ac- cesses a file system path. Its hierarchy comes naturally along with the given path hierarchy of the file system. The sequence variable expresses the temporal order of calls within a process. The directly preceding call is the highest, less recent calls are at deeper levels of the hierarchy. We collected system call data from eleven applications 55 Memory Run-time Similarity Min Max Avg Min Max Avg Avg Dev T 19 151 111 16 219 79 0.997 0.006 FA TP 25 9,971 5,988 31 922 472 0.994 0.003 TPS 736 73,403 48,820 78 14,422 6,569 0.987 0.008 T 7 105 70 15 219 74 0.985 0.010 PA TP 7 4,671 2,837 31 5,109 2,328 0.957 0.017 TPS 141 18,058 10,547 78 150,781 74,342 0.921 0.026 Table 12: Memory consumption (number of stored tupels), run-time (milliseconds) and similarity to exact solution of the Full Ancestry (FA) and Partial Ancestry (PA) algorithms (ε = 0.0005, φ = 0.002). Minimum (Min), maximum (Max) and average (Avg) values were calculated over measurements for the first log file of all eleven applications with varying dimensionality of the element tupels (T = type hierarchy, P = path hierarchy, S = sequence hierarchy). * COMM FILE PROC INFO DEV socket open fcntl64 duplicate dflagsnotify sig log fflags lease Figure 4: Parts of the taxonomy we defined for the hierar- chical variable type. (like Firefox, Epiphany, NEdit, XEmacs) with the strace tool (version 4.5.17) under Ubuntu Linux (kernel 2.6.26, 32 bit). All child processes were monitored by using option -f of strace. For each application, we logged five times five minutes and five times ten minutes of system calls if they belonged to one of the 54 types shown in Table 11, resulting in a whole of 110 log files comprising about 23 million of lines (1.8 GB). 3.3 Resulting Aggregation through Hierarchical Heavy Hitters We have implemented the Full Ancestry and Partial An- cestry variants of the HHH algorithm mentioned in Sec- tion 3.1. The code was integrated into the RapidMiner data mining tool. Regarding run-time, all experiments were done on a machine with Intel Core 2 Duo E6300 processor with 2 GHz and 2 GB main memory. Since we want to aggregate system call data on devices that are severely limited in processing power and available memory, measuring the resource usage of our algorithms was of paramount importance. Table 12 shows the run-time and memory consumption of the Full Ancestry and Partial Ancestry algorithms using only the type hierarchy, the type and path hierarchy, or the type, path, and sequence hier- archy. Minimum, maximum and averages were calculated over a sample of the ten gathered log files for each of the eleven application by taking only the first log file for each application into account. Memory consumption and run-time increase with the di- mensionality of the elements, while at the same time ap- proximation quality decreases. Quality is measured as sim- ilarity to the exact solution. Full Ancestry has a higher ap- proximation quality in general. The results correspond to observations made by Cormode and are probably due to the fact that Partial Ancestry outputs bigger HHH sets, which was the case in our experiments, too. Note that approxima- tion quality can always be increased by changing parameter ε to a smaller value at the expense of a longer run-time. Even for three-dimensional elements, memory consump- tion is quite low regarding the number of stored tuples. The largest number of tuples, 73,403, only equates to a few hun- dred kilobytes in main memory! The longest run-time of 150,781 ms for Partial Ancestry in three dimensions relates to the size of the biggest log file (application Rhythmbox). Figure 5 shows the behaviour of our algorithms on the biggest log file (application Rhythmbox) for three dimen- sions with varying ε and constant φ. Memory consumption and quality decrease with increasing ε, while the run-time increases. So the most important trade-off involved here is weighting memory consumption against approximation quality — the run-time is only linearly affected by param- eter ε. Again, Full Ancestry shows a better approximation quality in general. Classification results For the 110 log files of all applications, we determined the HHHs, resulting in sets of frequent tupels of hierarchical values. Interpreting each HHH set as an example of appli- cation behaviour, we wanted to answer the question if the profiles could be separated by a classifier. So we estimated the expected classification performance by a leave-one-out validation for kNN. Therefore, we needed to define a distance measure for the profiles determined by HHH algorithms. The data structures of HHH algorithms contain a small subset of pre- fixes of stream elements. The estimated frequencies fp are calculated from such data structure by the output method and compared to φ, thereby generating a HHH set. The similarity measure DSM operates not on the HHH sets, but directly on the internal data structures D1, D2 of two HHH algorithms: sim(D1, D2) = ∑ p∈P1∩P2 contribDSM(p) |P1 ∪ P2| . Be f ip the estimated frequency of prefix p for data struc- ture Di as normally calculated by the HHH output method. The contribution of individual prefixes to overall similarity can then be defined as contribDSM(p) = 2 ·min(f1p , f2p ) min(f1p , f 2 p ) + max(f 1 p , f 2 p ) . The so defined similarity measure is independent from the choice of φ, as no HHH sets need to be calculated. The classification errors for different values of k, hierar- chies and distance measures are shown in Table 13. The new DSM distance measure which is independent of pa- rameter φ shows the lowest classification error in all vali- dation experiments. As a baseline, we also determined the relative frequencies (TF, term frequencies) of call types per 56 T TS k DSM TF DSM TF 3 10.3 17.0 7.7 17.0 5 12.7 18.7 8.7 18.7 7 14.0 21.7 8.7 21.7 9 14.0 21.0 9.0 21.0 Table 13: Results for kNN (k = 3, 5, 7, 9), ε = 0.0005, φ = 0.002 and distance measures DSM and TF, when only the type hierarchy or type and sequence hierarchy together are used. log file and classified them using kNN (with Euclidean dis- tance). The error for profiling by HHH sets is significantly lower than for the baseline. 4 Conclusion Server-based and device-based learning has been investi- gated regarding resource constraints. memory consump- tion. Aggregation using HHH worked successfully for the classification of applications. Further work will exploit HHH aggregation for other learning tasks and inspect other data streaming algorithms. Concerning server-based learn- ing, we may now answer the questions from the introduc- tion, whether structural models are demanded for the pre- diction of user behavior on the basis of system calls, or simpler models such as Naive Bayes suffice. Should the sequential nature of system calls be taken into account by the algorithm? Or is it sufficient to encode the sequences into the features? Or should features as well as algorithm be capable of explicitly addressing sequences? We have compared CRF and NB with respect to their model quality, memory consumption, and runtime. Neither encoding the sequence into features nor applying an algorithm which is made for sequential information (i.e., CRF) outperforms a simple model (i.e., NB). This is in contrast with studies on intrusion detection, where it was shown advantageous to take into account the structure of system calls, utilizing Conditional Random Fields (CRF) [Gupta et al., 2007] and special kernel func- tions to measure the similarity of sequences [Tian et al., 2007]. Structured models in terms of special tree kernel functions outperformed n-gram representations when de- tecting malicious SQL queries [Bockermann et al., 2009]. Possibly, for prefetching strategies, the temporal order of system calls is not as important as we expected it to be. In the near future the resulting improvements in terms of cache hit rate and file operation latencies will be evaluated systematically based on a cache simulator and by modify- ing the Linux kernel. Given regular processors, CRF are only applicable in server-based learning. Possibly, the integration of special processors into devices and a massively parallel training algorithm could speed up CRF for device-based learning. Further work will implement CRF on a GPGPU (general purpose graphic processing unit). GPGPUs will soon be used by mobile devices. It has been shown that their en- ergy efficiency is advantagous [Timm et al., 2010]. References [Bockermann et al., 2009] Christian Bockermann, Martin Apel, and Michael Meier. Learning sql for database in- trusion detection using context-sensitive modelling. In Proc. 6th Detection of Intrusions and Malware, and Vul- nerability Assessment, pages 196 – 205. Springer, 2009. [Bovet and Cesati, 2005] Daniel Bovet and Marco Ce- sati. Understanding the Linux Kernel, Third Edition. O’Reilly & Associates, Inc., Sebastopol, CA, USA, 2005. [Cantrill et al., 2004] Bryan M. Cantrill, Michael W. Shapiro, and Adam H. Leventhal. Dynamic instrumen- tation of production systems. In Proc. of USENIX ATEC ’04, Berkeley, USA, 2004. USENIX. [Cormode et al., 2003] Graham Cormode, Flip Korn, S. Muthukrishnan, and Divesh Srivastava. Finding hier- archical heavy hitters in data streams. In VLDB ’2003: Proceedings of the 29th international conference on Very large data bases, pages 464–475. VLDB Endow- ment, 2003. [Cormode et al., 2004] Graham Cormode, Flip Korn, S. Muthukrishnan, and Divesh Srivastava. Diamond in the rough: finding hierarchical heavy hitters in multi- dimensional data. In SIGMOD ’04: Proceedings of the 2004 ACM SIGMOD international conference on Man- agement of data, pages 155–166, New York, NY, USA, 2004. ACM. [Cormode et al., 2008] Graham Cormode, Flip Korn, S. Muthukrishnan, and Divesh Srivastava. Finding hi- erarchical heavy hitters in streaming data. ACM Trans. Knowl. Discov. Data, 1(4):1–48, 2008. [Domingos and Pazzani, 1996] Pedro Domingos and Michael Pazzani. Beyond independence: Conditions for the optimality of the simple bayesian classifier. In Machine Learning, pages 105–112. Morgan Kaufmann, 1996. [Frank and Asuncion, 2010] A. Frank and A. Asuncion. UCI machine learning repository, 2010. [Gupta et al., 2007] K. Gupta, B. Nath, and K. Ramamo- hanarao. Conditional random fields for intrusion detec- tion. In 21st Intl. Conf. on Adv. Information Netw. and Appl., pages 203–208, 2007. [Hastie et al., 2003] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, corrected edition, July 2003. [Huang et al., 2003] Jin Huang, Jingjing Lu, and Lambda Charles X. Ling. Comparing naive bayes, decision trees, and svm with auc and accuracy. In in: Third IEEE In- ternational Conference on Data Mining, ICDM 2003, pages 553–556. IEEE Computer Society, 2003. [Lafferty et al., 2001] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilis- tic models for segmenting and labeling sequence data. Proc. 18th International Conf. on Machine Learning, pages 282–289, 2001. [Lohmann et al., 2009] Daniel Lohmann, Wanja Hofer, Wolfgang Schro¨der-Preikschat, Jochen Streicher, and Olaf Spinczyk. CiAO: An aspect-oriented operating- system family for resource-constrained embedded sys- tems. In Proc. of USENIX ATEC, Berkeley, USA, 2009. USENIX. [Malouf, 2002] Robert Malouf. A comparison of algo- rithms for maximum entropy parameter estimation. In COLING-02: proceedings of the 6th conference on Nat- ural language learning, pages 1–7, Morristown, NJ, USA, 2002. Association for Computational Linguistics. 57 7 Ressourcenbedarf der Algorithmen ! ! !!!! !!!! ! !!!!!!!!!! !! ! !! !!!! ! ! !!! ! !!! ! !!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!! !!!!!!!!!!! !!!!!! !!!!!!!!!!!!!!!!!!!! !!!!!! ! !! !! !!! ! !! ! !! ! !!!!!!! !!!!! ! !!!!! !!!!!!!!!!!!!!!!!!!!!! 0.000 0.002 0.004 0.006 0.008 0.010 0 5 0 1 0 0 1 5 0 FullAncestry, Speicherbed., eindim. ! M a x . A n z . T u p e l ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !!! !!!!!!!!!!!!!! ! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 0.000 0.002 0.004 0.006 0.008 0.010 0 4 0 0 0 8 0 0 0 FullAncestry, Speicherbed., zweidim. ! M a x . A n z . T u p e l ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! !! ! ! ! ! ! ! ! ! !!!! ! ! ! !!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!! ! ! !!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 0.000 0.002 0.004 0.006 0.008 0.010 0 2 0 0 0 0 4 0 0 0 0 FullAncestry, Speicherbed., dreidim. ! M a x . A n z . T u p e l !! !!!!!!!!!!!!!!!!!!!!!!!! !!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!!!!!!!!!!!!!! ! !!!!! !!!!!!!!!!!!!!!! !! !!!!! ! 0.000 0.002 0.004 0.006 0.008 0.010 0 5 0 1 0 0 1 5 0 PartialAncestry, Speicherbed., eindim. ! M a x . A n z . T u p e l ! ! ! ! ! ! !! ! ! ! !! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 0.000 0.002 0.004 0.006 0.008 0.010 0 4 0 0 0 8 0 0 0 PartialAncestry, Speicherbed., zweidim. ! M a x . A n z . T u p e l !! !! !! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 0.000 0.002 0.004 0.006 0.008 0.010 0 2 0 0 0 0 4 0 0 0 0 PartialAncestry, Speicherbed., dreidim. ! M a x . A n z . T u p e l Abbildung 7.1: Speicherbedarf (in Tupeln) fu¨r verschiedene Dimensionen bei Variation von ε am Beispiel der gro¨ßten Logdatei Rhythmbox 4 72 7 Ressourc nbedarf der Algorithmen ! ! !!!! !!!! ! !!!!!!!!!!!!!! !!!! ! ! !!! ! !!! ! !!!!!!!!! ! ! !! !!! ! ! !! ! !! ! ! !!! ! ! !!!!! !!!!!!!!!!!!! 0.00 0.002 0.004 0.006 0.008 0.010 0 5 0 1 0 0 1 5 0 FullAncestry, Sp icherbed., eindim. ! M a x . A n z . T u p e l ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !!! !!!!!!!!!!!!!! ! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! ! ! !!!!!!!!!!!!!! 0.00 0.002 0.004 0.006 0.008 0.010 0 4 0 0 0 8 0 0 0 FullAncestry, Sp icherbed., zweidim. ! M a x . A n z . T u p e l ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! !! ! ! ! ! ! ! ! ! !!!! ! ! ! !!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!! ! ! !!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! !!!!!!!!!!!!!!!!! 0.00 0.002 0.004 0.006 0.008 0.010 0 2 0 0 0 0 4 0 0 0 0 FullAncestry, Sp icherbed., reidim. ! M a x . A n z . T u p e l !! !!!!!!!!!!!!!!!!!!!!! !!! !!!!!!!!!!! !!!!!!!!! ! !! !!! !!! ! !! !!!!!!!!!!!! !! !!!!! ! 0.00 0.002 0.004 0.006 0.008 0.010 0 5 0 1 0 0 1 5 0 PartialAncestry, Sp icherbed., eindim. ! M a x . A n z . T u p e l ! ! ! ! ! ! !! ! ! ! !! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!! !!!!!!!!!!!!!! 0.00 0.002 0.004 0.006 0.008 0.010 0 4 0 0 0 8 0 0 0 PartialAncestry, Sp icherbed., zweidim. ! M a x . A n z . T u p e l !! !! !! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!! 0.00 0.002 0.004 0.006 0.008 0.010 0 2 0 0 0 0 4 0 0 0 0 PartialAncestry, Sp icherbed., reidim. ! M a x . A n z . T u p e l Abbildung 7.1: Speich rbedarf (in Tupeln) fu¨r verschiedene Dimensionen bei Variation von ε am Beispi l der gro¨ßten Logdatei R ythmbox 4 72 7.1 Sp icherbedarf und Laufzeiten ! !!!!! ! !! !!! ! !!!!! !!! !!!!! !!!!! !!!!!!! ! ! !!! !!!!!!! !!! !!!!! !!! !!! ! ! ! ! !! ! !!! ! ! !!! !! ! ! ! ! !!! ! ! !! !!! !!! !!!!!!!! !!! !!!!!!! !! ! !!!!! !!! !!!!! !!!!! !!! !! ! !! !!!! !!!! ! !!!! !!! !!!! !! !!!!! !!!!! !!!! ! ! ! ! 0.0 0 0.002 0.004 0.006 0.008 0.010 0 2 0 0 4 0 0 6 0 0 FullAncestry, Rechenzeit, eindim. ! Z e it i n m s !! !!!!! !!!!!!!!!! !!!!!!!!!!!!! !!!!!!!!!!!! !!!!!!!!! !!!!!!!!!!!!!!!!!!!! ! !!!!!!!!!!!!!! !!!!!!!!!!!!!! !!!!!!!!!!!!!!! !!!!!!!!!!! 0.0 0 0.002 0.004 0.006 0.008 0.010 0 1 0 0 0 2 5 0 0 FullAncestry, Rechenzeit, zweidim. ! Z e it i n m s !!! !!!!!!! !!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!! 0.000 0.002 0.004 0.006 0.008 0.010 0 e + 0 0 4 e + 0 4 8 e + 0 4 FullAncestry, Rechenzeit, dreidim. ! Z e it i n m s ! !!!!!!! !!! !!!!!!! ! !! !! !! !!!!!!!!!!! !!!! !!!!!!!! !!! !! ! !!!!! ! ! !! !!! !!!!!! !! !!!! !!!!! ! ! !! !!! !!!! !!!!!!!! !! !!!! !!! !!!! ! !! ! !! !!! ! ! !! !!!!!!!! !! ! !!!! !!!! !!! !!! !!! ! !!!!! !!! ! ! ! !!!!! ! !!! ! !!!! !!!! !! ! !! !! !!!!!!!!!!!!! ! 0.0 0 0.002 0.004 0.006 0.008 0.010 0 2 0 0 4 0 0 6 0 0 PartialAncestry, Rechenzeit, eindim. ! Z e it i n m s !! !!! ! ! !!!! !!! ! !!!!!!!!!!!! !!!! !!!!! ! !!! !!!!!!!!!!! !!!!!!!!! !! ! !!!! !!!!!!!! ! ! ! !!!!!!!!!!!!! !!!!!!!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!! !!!!!!!!!!!! 0.0 0 0.002 0.004 0.006 0.008 0.010 0 1 0 0 0 2 5 0 0 PartialAncestry, Rechenzeit, zweidim. ! Z e it i n m s !! !! !!!! !!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!! !!!! !!!!! !!!!!!!!! !!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!! !!!!!!!!!!!!! !!!!!!!!!! !!!!!!!!!!!!!!!!!! 0.000 0.002 0.004 0.006 0.008 0.010 0 e + 0 0 4 e + 0 4 8 e + 0 4 PartialAncestry, Rechenzeit, dreidim. ! Z e it i n m s Abbildung 7.2: Rechenzeit (in ms) fu¨r verschiedene Dimensionen bei Variation von ε am Beispiel der gro¨ßten Logdatei Rhythmbox 4 73 7.1 Sp ich rbedarf und Laufzeiten ! !!!! ! !! !!! !!!!!!!! !!!!!!!!! !!!!!!!! ! ! !!! !!!!!!! !!! !!!!! !!! !!! ! ! ! ! !!!!!!! !!! !!! !!!!!!! !!! !!!! !!!!!! !!! !!!!!!! !!!!!!!!! ! ! !!!! !!!!! !!! !! ! !! !!!! !!!! ! !!!! !!! !!!! !! !!!!! !!!!! !!!! ! ! ! ! 0.00 0.002 0.004 0.006 0.008 0.010 0 2 0 0 4 0 0 6 0 0 FullAncestry, Rechenz t, eindim. ! Z e it i n m s !! !!!! !! !!!!!!!! !!!!!!!!!!! !!!!!!!!!!!!! !!!!!!!!!! !!!!!!!!!!!!!!!!!!! ! ! !!!!!!!!!!!! !!!!!!!!!!!!!!!! !!!!!!!!!!!!!!! !!!!!!!!!!! 0.00 0.002 0.004 0.006 0.008 0.010 0 1 0 0 0 2 5 0 0 FullAncestry, Rechenzeit, zweidim. ! Z e it i n m s !!! !!!!!!! !!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!! !!!!!!!!!!!!! !!!!!!!!!!!!! 0.00 0.002 0.004 0.006 0.008 0.010 0 e + 0 0 4 e + 0 4 8 e + 0 4 FullAncestry, Rechenzeit, reidim. ! Z e it i n m s ! !!!!!!! !!!!! !!!!! ! !!! !!!!!!! !!!! !!!!!!!! !!!! !!!!!! !!! !!!!!! !! !!!! !!!!!!!! !! !!!!!!!!! !!!!! ! ! !!! !!!! ! !! ! !! !!! ! !! !!!! !!!! !!!!! !!!! ! ! !!!!!!!!! !! !!!!!!! !! !!!! !!!!! ! !!! !!!!!!!!!!!!! ! 0.00 0.002 0.004 0.006 0.008 0.010 0 2 0 0 4 0 0 6 0 0 PartialAncestry, Rechenz t, eindim. ! Z e it i n m s !!!!!! !! ! !!!!!!!!!!!!!!! !!!!!!!!!!!! !! ! ! !!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!! !!!!!!!!!!!!!!!!! !!!!!!!!!!!! 0.00 0.002 0.004 0.006 0.008 0.010 0 1 0 0 0 2 5 0 0 PartialAncestry, Rechenzeit, zweidim. ! Z e it i n m s !! !! !!!! !!!!!!!!!!!!! !!!!!!!!!!!!!!!!!! !!!! !!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!! 0.00 0.002 0.004 0.006 0.008 0.010 0 e + 0 0 4 e + 0 4 8 e + 0 4 PartialAncestry, Rechenzeit, reidim. ! Z e it i n m s Abbildung 7.2: R ch nze t (in ms) fu¨r verschiedene Dimensionen bei Variation von ε am Beispi l der gro¨ßten Logdatei R ythmbox 4 73 7.3 Approxima ionsgu¨te ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2e!04 4e!04 6e!04 8e!04 1e 03 0 .7 5 0 .8 5 0 .9 5 FullAncestry, MOGM, eindim. ! Ä h n lic h k e it z u e x a k t ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2e!04 4e!04 6e!04 8e!04 1e!03 0 .7 5 0 .8 5 0 .9 5 FullAncestry, MOGM, zweidim. ! Ä h n lic h k e it z u e x a k t ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2e!04 4e!04 6e!04 8e!04 1e!03 0 .7 5 0 .8 5 0 .9 5 FullAncestry, MOGM, dreidim. ! Ä h n lic h k e it z u e x a k t ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2e!04 4e!04 6e 04 8e!04 1e!03 0 .7 5 0 .8 5 0 .9 5 PartialAncestry, MOGM, eindim. ! Ä h n lic h k e it z u e x a k t ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2e!04 4e!04 6e!04 8e!04 1e!03 0 .7 5 0 .8 5 0 .9 5 PartialAncestry, MOGM, zweidim. ! Ä h n lic h k e it z u e x a k t ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2e!04 4e!04 6e!04 8e!04 1e!03 0 .7 5 0 .8 5 0 .9 5 PartialAncestry, MOGM, dreidim. ! Ä h n lic h k e it z u e x a k t Abbildung 7.3: Approximationsgu¨te bei Variation von ε. Der Wert φ wurde auf 10−3 gesetzt, Logdatei ist Rhythmbox 4. 77 7.3 Approximationsgu¨te ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2e!04 4e!04 6e!04 8e!04 1e!03 0 .7 5 0 .8 5 0 .9 5 FullAncestry, MOGM, eindim. ! Ä h n lic h k e it z u e x a k t ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2e!04 4e!04 6e!04 8e!04 1e!03 0 .7 5 0 .8 5 0 .9 5 FullAncestry, MOGM, zweidim. ! Ä h n lic h k e it z u e x a k t ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2e!04 4e!04 6e!04 8e!04 1e!03 0 .7 5 0 .8 5 0 .9 5 FullAncestry, MOGM, reidim. ! Ä h n lic h k e it z u e x a k t ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2e!04 4e 04 6e 04 8e!04 1e!03 0 .7 5 0 .8 5 0 .9 5 PartialAncestry, MOGM, eindim. ! Ä h n lic h k e it z u e x a k t ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2e!04 4e!04 6e!04 8e!04 1e!03 0 .7 5 0 .8 5 0 .9 5 PartialAncestry, MOGM, zweidim. ! Ä h n lic h k e it z u e x a k t ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2e!04 4e!04 6e!04 8e!04 1e!03 0 .7 5 0 .8 5 0 .9 5 PartialAncestry, MOGM, reidim. ! Ä h n lic h k e it z u e x a k t Abbildung 7.3: Approximationsgu¨te bei Variation von ε. Der Wert φ wurde auf 10−3 gesetzt, Logdatei ist R ythmbox 4. 77 Full Ancestry, memory Partial Ancestry, memory Full Ancestry, run-time Partial Ancestry, run-time Full Ancestry, approximation Partial ncestry, approximation 0. 0. 0.0 .0 .0 .0 0. 0.0 0.0 0.0 .0 0. 0. .0 .0 .00 .0 0. 0. 0.0 0.00 0.00 .0 0.000 0. 02 0.004 0.0 0.0 8 0.010 0.000 0. 02 0. 04 0.006 0.0 8 0.010 0 20 00 0 40 00 0 0 20 00 0 40 00 0 0 40 00 0 80 00 0 0 40 00 0 80 00 0 0. 75 0. 85 0. 95 0. 75 0. 85 0. 95 M ax . n o of tu pe ls ɛ ɛ ɛ ɛ ɛ ɛ M ax . n o of tu pe ls Ru n- tim e in m s Ru n- tim e in m s Si m . t o ex ac t s olu tio n Si m . t o ex ac t s olu tio n a) b) c) d) e) f) Figure 5: Memory consumption (a, b), run-ti e (c, d) and similarity to exact solution (e, f) of HHH algorithms (three- dimensional) with varying ε, φ = 0.001 on bigg st log file of application Rhythmbox. [Nocedal, 1980] Jorge Nocedal. Updating quasi-newton matrices with limited storage. Mathematics of Compu- tation, 35(151):773–782, 1980. [Rabiner, 1989] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, Feb 1989. [Schraudolph and Graepel, 2002] Nicol N. Schraudolph and Thore Graepel. Conjugate directions for stochas- tic gradient descent. In ICANN ’02: Proceedings of the International Conference on Artificial Neural Networks, pages 1351–1358, London, UK, 2002. Springer-Verlag. [Schraudolph et al., 2007] Nicol N. Schraudolph, Jin Yu, and Simon Gu¨nter. A stochastic quasi-Newton method for online convex optimization. In Marina Meila and Xiaotong Shen, editors, Proc. 11th Intl. Conf. Artificial Intelligence and Statistics (AIstats), volume 2 of Work- shop and Conference Proceedings, pages 436–443, San Juan, Puerto Rico, 2007. [Sha and Pereira, 2003] Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In NAACL ’03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Com- putational Linguistics on Human Language Technology, pages 134–141, Morristown, NJ, USA, 2003. Associa- tion for Computational Linguistics. [Silberschatz et al., 2010] Abraham Silberschatz, Pe- ter Baer Galvin, and Greg Gagne. Operating System Concepts. Wiley Publishing, 2010. [Sutton and McCallum, 2007] C. Sutton and A. McCal- lum. An Introduction to Conditional Random Fields for Relational Learning. In Lise Getoor and Ben Taskar, editors, Introduction to Statistical Relational Learning. MIT Press, 2007. [Tartler et al., ] Reinhard Tartler, Daniel Lohmann, Wolf- gang Schro¨der-Preikschat, and Olaf Spinczyk. Dynamic AspectC++: Generic advice at any time. In The 8th Int. Conf. on Software Methodologies, Tools and Tech- niques, Prague. IOS Press. (to appear). [Tian et al., 2007] S. Tian, S. Mu, and C. Yin. Sequence- similarity kernels for SVMs to detect anomalies in sys- tem calls. Neurocomput., 70(4–6):859–866, 2007. [Timm et al., 2010] C. Timm, A. Gelenberg, F. Weichert, and P. Marwedel. Reducing the Energy Consumption of Embedded Systems by Integrating General Purpose GPUs. Technical Report 829, Technische Universita¨t Dortmund, Fakulta¨t fu¨r Informatik, 2010. [Vishwanathan et al., 2006] S. V. N. Vishwanathan, Nicol N. Schraudolph, Mark W. Schmidt, and Kevin P. Murphy. Accelerated training of conditional random fields with stochastic gradient methods. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 969–976, New York, NY, USA, 2006. ACM. 58 Workflow Analysis using Graph Kernels Natalja Friesen and Stefan Ru¨ping Fraunhofer IAIS Schloss Birlinghoven, St. Augustin, Germany {natalja.friesen,stefan.rueping}@iais.fraunhofer.de Abstract Workflow enacting systems are a popular tech- nology in business and e-science alike to flex- ibly define and enact complex data processing tasks. Since the construction of a workflow for a specific task can become quite complex, efforts are currently underway to increase the re-use of workflows through the implementation of spe- cialized workflow repositories. While existing methods to exploit the knowledge in these repos- itories usually consider workflows as an atomic entity, our work is based on the fact that work- flows can naturally be viewed as graphs. Hence, in this paper we investigate the use of graph kernels for the problems of workflow discovery, workflow recommendation, and workflow pat- tern extraction, paying special attention on the typical situation of few labeled and many un- labeled workflows. To empirically demonstrate the feasibility of our approach we investigate a dataset of bioinformatics workflows retrieved from the website myexperiment.org.1 1 Introduction Workflow enacting systems are a popular technology in business and e-science alike to flexibly define and enact complex data processing tasks. A workflow is basically a description of the order in which a set of services have to be called with which input in order to solve a given task. Since the construction of a workflow for a specific task can become quite complex, efforts are currently underway to increase the re-use of workflows through the implementa- tion of specialized workflow repositories. Driven by specific applications, a large collection of workflow systems have been prototyped such as Taverna [Oinn et al., 2004] or Triana [Taylor et al., 2006]. As the high numbers of workflows can be generated and stored relatively easily it becomes increasingly hard to keep an overview about the available workflows. Workflow repositories and websites such as myexperiment.org tackle this problem by offering the research community the possibility to publish and exchange complete workflows. An even higher amount of integration has been described in the idea of developing a Virtual Research Environment (VRE, [Fraser, 2005]). Due to the complexity of managing a large repository of workflows, data mining approaches are needed to sup- 1This paper also appears at the Workshop on Service-Oriented Knowledge Discovery (SoKD 2010) port the user in making good use of the knowledge that is encoded in these workflows. In order to improve the flexi- bility of a workflow system, a number of data mining tasks can be defined: Workflow recommendation: Compute a ranking of the available workflow with respect to their interesting- ness to the user for a given task. As it is hard to for- mally model the user’s task and his interest in a work- flow, one can also define the task of finding a measure of similarities on workflows. Given a (partial) work- flow for the task the user is interested in, the most sim- ilar workflows are then recommended to the user. Metadata extraction: Given a workflow (and possibly partial metadata), infer the metadata that describes the workflow best. As most approaches for searching and organizing workflow are based on descriptive meta- data, this task can be seen as the automatization of the extraction of workflow semantics. Pattern extraction: Given a set of workflows, extract a set of sub-patterns that are characteristic for these work- flow. A practical purpose of these patterns is to serve as building block for new workflows. In particular, given several sets of workflows, one can also define the task of extracting the most discriminative patterns, i.e. patterns that are characteristic for one group but not the others. Several approaches to these problems exists in the data mining literature, in particular on machine learning for rec- ommender systems [Goderis, 2008]. However, existing methods usually consider workflows as an atomic entity, using workflow meta data such as its usage history, textual descriptions (in particular tags), or user-generated quality labels as descriptive attributes. While these approaches can deliver high quality results, they are limited by the fact that all these attributes require either a high user effort to de- scribe the workflow (to use text mining techniques), or a frequent use of each workflow by many different users (to mine for correlations). Figure 1 shows the evolution of information about a workflow over its lifetime. In the construction phase, mainly technical information about the workflow itself is generated. Possibly starting from a workflow pattern, sev- eral versions of a workflow are generated and tested. In the submission phase, the workflow is made public for possible re-use. Here, the owner gives a more or less detailed de- scription about the purpose of the workflow and about the task that it is supposed to achieve. In the final phase, the re-use of the workflow by different users begins, generat- ing information about workflow usage, such as correlations 59 patternpattern construction w or kfl ow tas k qu ali ty pattern wf versions tag description users votes submission owner re−use Figure 1: Workflow data: over the lifetime of a workflow (horizontal axis), information on different levels of gener- ality is generated (vertical axis) with usage of other workflows. User possibly vote on the quality of the workflow. In this paper we are interested in supporting the user in constructing the workflow and reducing the manual effort of workflow tagging. The reason for the focus on the early phases of workflow construction is that in practice it can be observed that often users are reluctant to put too much effort into describing a workflow; they are usually only in- terested in using the workflow system as a means to get their work done. A second aspect to be considered is that without proper means to discover existing workflows for re-use, it will be hard to receive enough usage information on a new workflow to start up a correlation-based recom- mendation in the first place. To address these problems, we have opted to investigate solutions to the previously described data mining tasks that can be applied in the common situation of many unlabeled workflows, using only the workflow description itself and no meta data. Our work is based on the fact that work- flows can be viewed as graphs. We will demonstrate that by the use of graph kernels it is possible to effectively ex- tract workflow semantics and use this knowledge for the problems of workflow recommendation and metadata ex- traction. The purpose of this paper is to answer the follow- ing questions: Q1: How good are graph kernels at performing the tasks of workflow recommendation without explicit user in- put? We will present an approach that is based on ex- ploiting workflow similarity. Q2: Can appropriate meta data about a workflow be ex- tracted from the workflow itself? What can we infer about the semantics of a workflow and its key char- acteristics? In particular, we will investigate the task of tagging a workflow with a set of user-defined key- words. Q3: How good does graph mining perform at a descriptive approach of workflow analysis, namely the extraction of meaningful graph patterns? The remainder of the paper is structured as follows: Next, we will discuss related work in the area of work- flow systems. In Section 3, we give a detailed discus- sion of representation of workflows and the associated metadata. Section 4 will present the approach of us- ing graph kernels for workflow analysis. The approach will be evaluated on 4 distinct learning tasks on a dataset of bioinformatics workflows retrieved from the website http://myexperiment.org in Section 5. Section 6 concludes. 2 Related Work Since workflow systems are getting more complicated, the development of effective discovery techniques particularly for this field has been addressed by many researcher dur- ing the last years. Public repositories that enables sharing of workflows are widely used both in business and scien- tific communities. While first steps toward supporting the user have been made, there is still a need to improve ef- fectiveness of discovery methods and support the user in navigating in the space of available workflows. A detailed overview of different approaches for workflow discovery is given by Goderis [Goderis, 2008]. Most approaches are based on simple search functional- ities and consider workflow as an atomic entity. Searching over workflow annotation like titles, textual description or discovery on the basis of user profiles belongs to basic ca- pabilities of such academic repositories as myExperiment [Roure, 2009], BioWep 2, Kepler 3 or commercial systems like Infosense and Pipeline Pilot. In [Goderis et al., 2009] a detailed study about cur- rent practices in workflow sharing, re-using and retrieval is presented. To summarize, the need to take into account structural properties of workflows in the retrieval process was underlined by several users. The authors demonstrate that existing techniques are not sufficient and there is still a need for effective discovery tools. In [Goderis et al., 2006] retrieval techniques and methods for ranking discov- ered workflows based on graph-subisomorphism matching are presented. Coralles [Corrales et al., 2006] proposes a method for calculating the structural similarity of two BPEL (Business Process Execution Language) workflows represented by graphs. It is based on error correcting graph subisomorphism detection. Apart from workflow sharing and retrieval, the design of new workflows is an immense challenge to users of work- flow systems. It is both time-consuming and error-prone, as there is a great diversity of choices regarding services, parameters, and their interconnections. It requires the re- searcher to have specific knowledge in both his research area and in the use of the workflow system. Consequently, it is preferable for a researcher to not start from scratch, but to receive assistance in the creation of a new workflow. A good way to implement this assistance is to reuse or re- purpose existing workflows or workflow patterns (i.e. more generic fragments of workflows). An example of workflow re-use is given in [Goderis, 2008], where a workflow to identify genes involved in tolerance to Trypanosomiasis in East African cattle was reused successfully by another sci- entist to identify the biological pathways implicated in the ability of mice to expel the Trichuris Muris parasite. In [Goderis et al., 2005] it is argued that designing new workflows by reusing and re-purposing previous workflows or workflows patterns has the following advantages: • Reduction of workflow authoring time • Improved quality through shared workflow develop- ment • Improved experimental provenance through reuse of established and validated workflows • Avoidance of workflow redundancy 2http://bioinformatics.istge.it/biowep/ 3https://kepler-project.org/ 60 While there has been some research comparing work- flow patterns in a number of commercially available work- flow management systems [Van Der Aalst et al., 2003] or identifying patterns that describe the behavior of business processes [White, March 2004], to the best of our knowl- edge there exists no work to automatically extract patterns. A pattern mining method for business workflows based on calculation of support values is presented in [Thom et al., 2007]. However, the set of patterns that was used was de- rived manually based on an extensive literature study. 3 Workflows A workflow is a way to formalize and structure complex data analysis experiments. Scientific workflows can be de- scribed as a sequence of computation steps together with predefined input and output that arise in scientific problem- solving. Such definition of workflow enables sharing anal- ysis knowledge within scientific communities in conve- nient way. Since we want to provide the user with sug- gestions about possibly interesting workflows, we need to be able to compare workflows available in existing VREs with the one that the user has. That implies that workflows need a consistent representation. 3.1 Example: myExperiment.org We consider the discovery of similar workflow in the con- text of a specific VRE called myExperiment [Roure et al., 2008], MyExperiment has been developed to support shar- ing of scientific objects associated with an experiment. It is a collaborative environment where scientists can safely publish their workflows and experiment plans, share them with groups and find those of others. MyExperiment is de- signed to make it easy for scientists to contribute to a pool of scientific workflows, build communities and form rela- tionships. Its goal is to enable scientists to share and reuse workflows, reduce the time-to-experiment, and avoid rein- vention. Figure 2 shows a workflow view in myExperiment. Each stored workflow is created by a specific user, is associated with a workflow graph, and contains metadata and certain statistics such as the number of downloads or rating. We split all available information about a workflow into four different groups: the workflow graph, textual data, user in- formation, and workflow statistics. Next we will character- ize each group in more detail. Graph properties of work- flow are detailed described in 4 3.2 Textual Data Each workflow in myExperiment has a title and a descrip- tion text and contains information about the creator and date of creation. Furthermore, the associated tags anno- tate workflow by several keywords that facilitate search- ing for workflows and provide more precise results. Ad- ditionally, for workflows in Taverna format myExperiment displays visual previews, detailed information about single workflow components, such as type of inputs and outputs, number and type of certain processors. 3.3 User Information MyExperiment was created not only as an environment for sharing workflows, but also as a social infrastructure for the researchers. The social component is realized by reg- istration of users and allowing them to create profiles with different kind of personal information, details about their work and professional life. The members of myExperiment can form complex relationships with other members, such as creating or joining user groups or giving credit to others. All this information can be used in order to find the groups of users having similar research interests or working in re- lated projects. In the end, this type of information can be used to generate the well known correlation-based recom- mendations of the type “users who liked this workflow also liked the following workflows...”. 3.4 Workflow Statistics As statistic data we consider information that is changing with the time, such as the number of views or downloads or the average rating. Statistic data can be very useful for pro- viding a user with a workflow he is likely to be interested in. As we do not have direct information about user prefer- ences, some of the statistics data, e.g. number of downloads or rating, can be considered as a kind of quality measure. 4 A Graph Mining Approach to Workflow Analysis The characterization of a workflow by metadata alone is challenging because neither of these features give an in- sight into the underlying sub-structures of the workflow. It is clear that users do not always create a new workflow from scratch, but most likely re-use old components and sub-workflows. Hence, knowledge of sub-structures is im- portant information to characterize a workflow completely. The common approach to represent objects for a learning problem is to describe them as vectors in a feature space. However, when we handle objects that have important sub- structures, such as workflows, the design of a suitable fea- ture space is not trivial. For this reason, we opt to follow a graph mining approach. 4.1 Frequent Subgraphs Frequent subgraph discovery has received intense and still growing attention, since it has a wide range of applications areas. Frequently occurring subgraphs in a large set of graphs could represent important motifs in the data. Given a set of graphs G, the support of a graph G (S(G)) is de- fined as the fraction of graphs in G in which G occurs. The problem of finding frequent patterns is defined as follows: Given a set of graphs G and minimum support Smin. We want to find all connected subgraphs G that occur fre- quently enough (i.e. S(G) >= Smin) over the entire set of graphs. The output of the discovery process may contain a large number of such patterns. 4.2 Graph Kernels Graph kernels, as originally proposed by [Gaertner et al., 2003; Kashima and Koyanagi, 2002] provide a general framework for handling graph data structures by kernel methods. Different approaches for defining graph kernels exist. A popular representation of graphs that is used for ex- amples in protein modeling and drug screening are kernels based on cyclic patterns [Horva´th et al., 2004]. However, these are not applicable to workflow data, as workflows are by definition acyclic (because an edge between services A and B represents the relation “A must finish before B can start”). To adequately represent the decomposition of workflows into functional substructures, we apply the following ap- proach: the set of graphs is searched for substructures (sub- trees) that occur in at least a given percentage (support) 61 Figure 2: MyExperiment: workflow with associated metadata. of all graphs. Then, the feature vector is composed of the weighted counts of the substructures. The substructures are sequences of labeled vertices that were produced by graph traversal. The length of a substructure is equal to the num- ber of vertices in it. This family of kernels is called Label Sequence Kernels. The main difference among the kernels lies in how graphs are traversed and how weights are in- volved in computing a kernel. According to the extracted substructures, these are kernels based on walks, trees or cycles. In our work we used the kernels based on ran- dom walks with exponential weights proposed by Ga¨rtner et al. [Gaertner et al., 2003]. Since workflows are directed acyclic graphs, in our special case the hardness results of [3] (such random walks can not be enumerated in gen- eral case) no longer hold and we actually can enumerate all walks. This allows us to explicitly generate the feature space representation of the kernels by defining the attribute values for every substructure (walk). For each substructure s in the set of graphs, let k be the length of the substructure. Then, the attribute λs is defined as: λs = βk k! (1) (and thus generate the explicit finite set of features di- rectly). if the graph contains the substructure s and λs = 0 else. Here β is a parameter that we have optimized by cross- validation. The graph kernel between graphs G1 and G2 is now de- fined as k(G1, G2) = ∑ s∈sub(G1)∩sub(G2) λs where the sum runs over all substructures s that are present in both G1 and G2. Note that if for a finite set of d sub- structures s1, . . . , sd we define λ(G) = ( √ λs11{s1∈G}, . . . , √ λsd1{sd∈G}) this is equivalent to k(G1, G2) = λ(G1) Tλ(G2). A very important advantage of graph kernels approach for the discovery task is that distinct substructures can pro- vide an insight into the specific behavior of the workflow. 4.3 Graph Representation of Workflows A workflow can be formalized as a directed acyclic la- beled graph. The workflow graph has two kind of nodes: regular nodes representing the computation operations and nodes defining input/output data structure. A set of edges shows information and control flow between the nodes. More formally, a workflow graph can be defined as a tu- ple W = (N,T ), where: N = {C, I,O} C = finite set of computation operations, I/O = finite set of inputs or outputs T ⊆ N × N = finite set of transitions defining control flow. Labeled graphs contain an additional source of infor- mation. There are several alternatives to obtain node la- bels. On the one hand, users often annotate single work- flow components by a combination of words or abbrevia- tions. On the other hand, each component within workflow system has a signature and an identifier associated with it, e.g. in web-service WSDL format. User created labels suf- fer from subjectivity and diversity, e.g. the same node rep- resenting the same computational operation can be labeled in very different way. The first alternative again assumes some type of user input, so we opt to use the second alter- native. Figure 3 shows an example of such transformation ob- tained for a Taverna workflow [Oinn et al., 2004]. While the left pictures shows a user annotated components the 62 Figure 3: Transformation of Taverna workflow to the work- flow graph. right picture presents workflow graph on the next abstrac- tion level. 5 Evaluation In this section we illustrate the use of workflow structure and graph kernels in particular for workflow discovery and pattern extraction. We evaluate results on a real-world dataset of Taverna workflows. However, the same approach can be applied to other workflow systems, as long as we can obtain meaningful consistent labels from workflows. 5.1 Dataset For the purposes of this evaluation we used a corpus of 300 real-world bioinformatics workflows retrieved from myEx- periment [Roure et al., 2008]. We chose to restrict our- selves to workflows that were created in Taverna work- bench [Oinn et al., 2004] in order to simplify the format- ting of workflows. Since the application area of myExper- iment is restricted to bioinformatics, we may be sure that there are sets of similar workflows there. The user feed- back about similarity of workflow pairs is missing. Hence, we used semantic information to obtain workflows similar- ity. We made the assumption that workflows targeting the same tasks are similar. Under this assumption we used the cosine similarity of the vector of tags assigned to the work- flow as a proxy for the true similarity. An optimization over the number of clusters resulted in 5 groups shown in Table 1. These tags indeed impose a clear structuring with few overlaps on the workflows. 5.2 Workflow Recommendation In this section, we address Question Q1: How good are graph kernels at performing the tasks of workflow recom- mendation without explicit user input? The goal is to re- trieve workflows that are ”close enough” to a user’s con- text. Therefore, we need to be able to compare workflows available in existing VREs with the user’s one. As simi- larity measure we use graph kernel, which is equivalent to the cosine distance between the feature vectors consisting of the weighted counts of the frequent subgraphs. We compare our approach based on graph kernels to several techniques representing the current state of the art [Goderis et al., 2006]: matching of workflow graphs based on the size of the maximal common subgraph (MCS) and a method that considers a workflow as a bag of services. In addition to these techniques we also consider a standard text mining approach, whose main idea is that workflows are documents in XML format. The similarity of a work- flow pair is then calculated as the cosine distance between the respective word vectors. In our experiment we predict if two workflows belong to the same cluster. Table 2 summarizes the average per- formances of a leave-one-out evaluation for the four ap- proaches. It can be seen that graph kernels clearly outper- form all other approaches in accuracy and recall. For preci- sion, MCS performs best, however, at the cost of a minimal recall. The recall of graph kernels ranks second and is close to the value of MCS. Therefore, we conclude that graph kernels are suitable for the task of workflow recommendation based only on graph structure without explicit user input. 5.3 Workflow Tagging We are now interested in Question Q2 of extraction of appropriate metadata from workflows. As a prototypical piece of metadata, we investigate user-defined tags. 20 tags were selected that occur in at least 3% of all workflows. We use tags as proxies that represent the real- world task that a workflow can perform. Selected tags present 20 different classification problems. For each tag we would like to predict if it describes an given work- flow. To do that we utilize graph kernels representation of workflow. We tested two algorithms: SVM and k-Nearest Neighbor. Table 3 shows results of tags prediction evalu- ated by 2-fold cross validation over 20 keywords. It can be seen that an SVM with graph kernels can predict the se- lected tags with high AUC and precision, while a Nearest Neighbor approach using graph kernels to define the dis- tance achieves a higher recall. We can conclude that the graph representation of work- flow contains enough information to predict appropriate metadata. 5.4 Pattern extraction Finally, we investigate question Q3, which deals with the more descriptive task of extracting meaningful patterns from sets of workflows that are helpful in the construction of new workflows. We address the issue of extracting patterns that are par- ticularly important within a group of similar workflows in several steps. First, we use a linear SVM to build a clas- sification model based on the graph kernels. This model identifies all workflows which belong to the same group against workflows from other groups. Then we search for features having high weight value which the model con- siders as important. We performed such pattern extraction targeting consequently each workflow group. A 10-fold ac- curacy shows that this classification can be achieved with high accuracy, values ranging between 81.3% and 94.7%, depending on the class. However, we are more interested in the most significant patterns, which we determine based on the weight that was assigned by the SVM (taking dif- ferent standard deviation into account). Examples of such patterns are described below. Figure 4 shows an example of workflow patterns and the same pattern inside of one of workflows that it occurs in. It was considered as important for classifying workflow from 63 Group Size Most frequent tags Description 1 30% localworker, example, mygrid Workflows using local scripts. 2 29% bioinformatics, sequence, protein, Sequence similarity search BLAST, alignment, similarity, using the BLAST algorithm structure, search, retrieval 3 24% benchmarks Benchmarks WFs. 4 6.7% AIDA , BioAID, text mining, Text mining on biomedical texts using bioassist, demo, biorange the AIDA toolbox and BioAID web services 5 6.3% Pathway, microarray, kegg Molecular pathway analysis using the Kyoto Encyclopedia of Genes and Genomes (KEGG) Table 1: Characterization of workflow groups derived by clustering. Method Accuracy Precision Recall Graph Kernels 81.2± 10.0 71.9± 22.0 38.3± 21.1 MCS 73.9± 9.3 73.5± 24.7 4.8± 27.4 Bags of services 73.5± 10.3 15.5± 20.6 3.4± 30.1 Text Mining 77.8± 8.31 67.2± 21.5 31.2± 25.8 Table 2: Performance of workflow discovery. group 2, which consists of workflows using the BLAST algorithm to calculate sequences similarity. The presented pattern is a sequence of components that are needed to run a BLAST service. This example shows that graph kernels can be used to extract useful patterns. These patterns then can be recom- mended to the user during creation of a new workflow. 6 Conclusions Workflow enacting systems have become a popular tool for the easy orchestration of complex data processing tasks. However, the design and management of workflows are a complex tasks. Machine learning techniques have the po- tential to significantly simplify this work for the user. In this paper, we have discussed the usage of graph ker- nels for the analysis of workflow data. We argue that graph kernels are a good tool for the analysis of workflow data in the practical important situation where no meta data is available. This is due to the fact that the graph kernel approach allows to take decompositions of the workflow into its important substructures into account while allow- ing an flexible integration of these information contained into these substructures into several learning algorithms. We have evaluated the use of graph kernels in the fields of workflow similarity prediction, metadata extraction, and pattern extraction. A comparison of graph-based work- flow analysis with metadata-based workflow analysis in the field of workflow quality modeling showed that metadata- based approaches outperform graph-based approaches in this application. However, it is important to recognize that the goal of the graph-based approach is not to replace the metadata-based approaches, but to serve as an extension when no or few metadata is available. References [Corrales et al., 2006] Juan Carlos Corrales, Daniela Grig- ori, and Mokrane Bouzeghoub. Bpel processes match- making for service discovery. In In Proc. CoopIS 2006, Lecture Notes in Computer Science 4275, pages 237– 254. Springer, 2006. [Fraser, 2005] M. Fraser. Virtual Research Environments: Overview and Activity. Ariadne, 2005. [Gaertner et al., 2003] Thomas Gaertner, Peter Flach, and Stefan Wrobel. On graph kernels: Hardness results and efficient alternatives. In Proceedings of the 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop, pages 129–143. Springer-Verlag, Au- gust 2003. [Goderis et al., 2005] Antoon Goderis, Ulrike Sattler, Phillip Lord, and Carole Goble. Seven bottlenecks to workflow reuse and repurposing. The Semantic Web ISWC 2005, pages 323–337, 2005. [Goderis et al., 2006] Antoon Goderis, Peter Li, and Ca- role Goble. Workflow discovery: the problem, a case study from e-science and a graph-based solution. In ICWS ’06: Proceedings of the IEEE International Con- ference on Web Services, pages 312–319. IEEE Com- puter Society, 2006. [Goderis et al., 2009] Antoon Goderis, Paul Fisher, An- drew Gibson, Franck Tanoh, Katy Wolstencroft, David De Roure, and Carole Goble. Benchmarking workflow discovery: a case study from bioinformat- ics. Concurr. Comput. : Pract. Exper., (16):2052–2069, 2009. [Goderis, 2008] Antoon Goderis. Workflow re-use and dis- covery in bioinformatics. PhD thesis, School of Com- puter Science, The University of Manchester, 2008. [Horva´th et al., 2004] Tama´s Horva´th, Thomas Ga¨rtner, and Stefan Wrobel. Cyclic pattern kernels for predictive graph mining. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 158–167, New York, NY, USA, 2004. ACM. [Kashima and Koyanagi, 2002] Hisashi Kashima and Teruo Koyanagi. Kernels for semi-structured data. In ICML ’02: Proceedings of the Nineteenth International Conference on Machine Learning, pages 291–298, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc. 64 Method AUC Precision Recall Nearest Neighbors 0.54± 0.18 0.51± 0.21 0.58± 0.19 SVM 0.85± 0.10 0.84± 0.24 0.38± 0.29 Table 3: Workflows tagging based on graph kernels. Figure 4: Example of workflow graph. 65 [Oinn et al., 2004] T Oinn, M.J. Addis, J. Ferris, D.J. Mar- vin, M. Senger, T. Carver, M. Greenwood, K Glover, M. R. Pocock, A. Wipat, and P. Li. Taverna: a tool for the composition and enactment of bioinformatics work- flows. Bioinformatics, 20(17):3045–3054, June 2004. [Roure et al., 2008] David De Roure, Carole Goble, Jiten Bhagat, Don Cruickshank, Antoon Goderis, Danius Michaelides, and David Newman. myexperiment: Defining the social virtual research environment. In 4th IEEE International Conference on e-Science, pages 182–189. IEEE Press, December 2008. [Roure, 2009] Robert Stevens David De Roure. The de- sign and realisation of the myexperiment virtual re- search environment for social sharing of workflows. 2009. [Taylor et al., 2006] Ian J. Taylor, Ewa Deelman, Den- nis B. Gannon, and Matthew Shields. Workflows for e-Science: Scientific Workflows for Grids. Springer- Verlag New York, Inc., Secaucus, NJ, USA, 2006. [Thom et al., 2007] Lucineia Thom, Cirano Iochpe, and Manfred Reichert. Workflow patterns for business pro- cess modeling. In Proc. of the CAiSE’06 Workshops - 8th Int’l Workshop on Business Process Modeling, Development, and Support (BPMDS’07), page Vol. 1. Trondheim, Norway, 2007. [Van Der Aalst et al., 2003] W. M. P. Van Der Aalst, A. H. M. Ter Hofstede, B. Kiepuszewski, and A. P. Bar- ros. Workflow patterns. Distrib. Parallel Databases, 14(1):5–51, 2003. [White, March 2004] Stephen A. White. Business process trends. In Business Process Trends, March, 2004. 66 Visually summarizing the Evolution of Documents under a Social Tag (Resubmission) Andre´ Gohr Leibniz Institute of Plant Biochemistry, Halle, Germany Myra Spiliopoulou Otto-von-Guericke University, Magdeburg, Germany Alexander Hinneburg Martin-Luther University Halle-Wittenberg, Germany Abstract Tags are intensively used in social platforms to annotate resources: Tagging is a social phe- nomenon, because users do not only annotate to organize their resources but also to associate semantics to resources contributed by third par- ties. This leads often to semantic ambiguities: Popular tags are associated with very disparate meanings, even to the extend that some tags (e.g. ”beautiful” or ”toread”) are irrelevant to the se- mantics of the resources they annotate. We pro- pose a method that learns a topic model for doc- uments under a tag and visualizes the different meanings associated with the tag. Our approach deals with the following prob- lems. First, tag miscellany is a temporal phe- nomenon: tags acquire multiple semantics gradu- ally, as users apply them to disparate documents. Hence, our method must capture and visualize the evolution of the topics in a stream of doc- uments. Second, the meanings associated to a tag must be presented in a human-understandable way; This concerns both the choice of words and the visualization of all meanings. Our method uses AdaptivePLSA, a variation of Probabilistic Latent Semantic Analysis for streams, to learn and adapt topics on a stream of documents an- notated with a specific tag. We propose a vi- sualization technique called Topic Table to vi- sualize document prototypes derived from top- ics and their evolution over time. We show by a case study how our method captures the evolu- tion of tags selected as frequent and ambiguous, and visualizes their semantics in a comprehensi- ble way. Additionally, we show the effectiveness by adding alien resources under a tag. Our ap- proach indeed visualizes hints to the added doc- uments. The full article is published in the proceedings of the International Conference on Knowledge Discovery and Information Retrieval, KDIR 2010 and the SciTePress Digital Library. A copy of the PDF-file is available from http://users.informatik.uni-halle.de/ ˜hinnebur/PS_Files/2010KDIR_TT.pdf. 67 One Clustering Process Fits All - A Visually Guided Ensemble Approach Martin Hahmann, Dirk Habich, Maik Thiele, Wolfgang Lehner Dresden University of Technology Database Technology Group 01187, Dresden, Germany dbinfo@mail.inf.tu-dresden.de Abstract Looking back on the past decade of research on clustering algorithms, we witness two ma- jor and apparent trends: 1) The already vast amount of existing clustering algorithms, is con- tinuously broadened and 2) clustering algorithms in general, are becoming more and more adapted to specific application domains with very par- ticular assumptions. As a result, algorithms have grown complicated and/or very scenario- dependent, which made clustering a hardly ac- cessible domain for non-expert users. This is an especially critical development, since, due to increasing data gathering, the need for analysis techniques like clustering emerges in many ap- plication domains. In this paper, we oppose the current focus on specialization, by proposing our vision of a usable, guided and universally appli- cable clustering process. In detail, we are go- ing to describe our already conducted work and present our future research directions. 1 Introduction To obtain an optimal clustering, knowledge in the domain of clustering algorithms and the domain of the application data is essential. In practise, the average application users, e.g, biologists, are usually experts of the data domain but only have limited knowledge about the available tools for clustering. Therefore, clustering is a challenging task for this domain experts [Jain and Law, 2005]. The reasons can be briefly summarized as follows: (i) The selection of a clustering algorithm is critical, since, in general, only a fraction of the available algorithms is known to the user. Moreover, most algorithms are tailored to specific tasks and are thus, not appropriate for every data set. (ii) Another ob- stacle is parameterization, which offers many degrees of freedom but provides nearly no support. (iii) Finally, the interpretation of results, is complicated by the multitude of existing visualization and validation techniques, which are also not universally applicable. In our opinion, the nearly unlimited options and the high degree specialization of algorithms are the major obstacles, which prevent non-expert users from the suc- cessful application of clustering algorithms. Therefore, we state that users first of all need a universally applica- ble process, rather than a zoo of highly customized clus- tering algorithm. Based on the paradigm of ensemble- clustering, we want to abandon specialization and develop a unifying clustering process. This process integrates the user, offers guidance and allows the purposeful naviga- tion through the available clustering solutions, as well as the step-by-step construction of a satisfying clustering re- sult, by adjustment of the ensemble clustering and on- demand generation of additional clusterings for parts of the dataset. To realize such an unifed clustering process, research effort has to be done on three areas: algorithms (Section 2), usability (Section 3) and architecture. In the remainder of this paper we will present our research re- sults we obtained so far [Hahmann et al., 2009; 2010b; 2010a] and state the open challenges in the respective re- search areas (Section 4). We omit the architecture area and refer to the following papers [Habich et al., 2007a; 2007b; 2010]. Finally, we conclude the paper with a brief sum- mary in Section 5 2 Algorithm Area - The Clustering Process In this section, we introduce the underlying algorithmic plaform for our unifed clustering process—Flexible Clus- tering Aggregation (FCA) [Hahmann et al., 2009]. The ba- sic concept of FCA is clustering aggregation, which com- bines different clusterings of a dataset into one result to increase quality and robustness [Hahmann et al., 2009; Gionis et al., 2007]. Different aggregation approaches are known, where the pairwise assignment approach is consid- ered as the most capable one [Boulis and Ostendorf, 2004; Caruana et al., 2006; Topchy et al., 2004; Zeng et al., 2002; Dimitriadou et al., 2001; Dudoit and Fridlyand, 2003; Fred, 2001; Fred and Jain, 2003; Frossyniotis et al., 2002; Gionis et al., 2007; Habich et al., 2006]. This approach evaluates each object pair of a dataset, determining whether it is assigned (i) to the same cluster or (ii) to different clus- ters. The aggregate is constructed by selecting the most frequent of these two pairwise assignments for each object pair and setting it in the result clustering. All existing ag- gregation techniques lack controllability, thus an aggrega- tion result can only be adjusted through modification and re-computation of the input clusterings. Our Flexible Clustering Aggregation (FCA) [Hahmann et al., 2009] tackles this issue. The key approach of our technique is to change the aggregation input from hard to soft clusterings [Bezdek, 1981]. These assign to each ob- ject its relative degree of similarity with all clusters in- stead of a hard assignment to just one cluster. Such as- signments can be (i)generated by specific algorithms like fuzzy c-means [Bezdek, 1981] or (ii) calculated from ar- bitrary clustering results, using refinement techniques like a-posteriori [Zeng et al., 2002]. Up to now, we only utilize fuzzy c-means to generate the clusterings for our ensembles. In a soft clustering result, each datapoint xi|(1 ≤ i ≤ n) 69 of a datasetD is assigned to all k clusters cj |(1 ≤ j ≤ k) of a clustering C to a certain degree. Thus, the assignment in- formation of xi in C is denoted as a vector vi with the com- ponents vip(1 ≤ p ≤ k)|0 < vip < 1and ∑k p=1 vip = 1 describing the relation between xi and the p-th cluster of C. This fine-grained information allows, e.g., the iden- tification of undecidable cluster assignments given when objects have identical maximal similarities with multiple clusters. Assume a clustering with k = 3, and an object xi with vi> = (0.4, 0.4, 0.2). Using this assignment, we cannot decide whether xi belongs to c1 or c2, although c3 can be excluded. Based on this, it is easy to see that the worst case regarding decidability is given for assignments with ∀vip(1 ≤ p ≤ k) = 1/k, since they do not even allow the exclusion of clusters when it comes to clear cluster af- filiations. Of these two kinds of undecidable assignments, we name the first balanced and the second fully balanced [Hahmann et al., 2009]. To incorporate this additional information, we expanded the pairwise assignment cases for the aggregation by adding an undecidable case that is valid for object pairs containing undecidable assignments. Furthermore, we de- rived a significance measure for pairwise assignments on that basis. This measure incorporates the intra-pair simi- larity of soft assignments and their decidability. The lower bound for decidability is defined as 0 or as an impossible decision and is given for the mentioned undecidable clus- ter assignments. The upper bound of 1 is given for objects with a single degree of similarity vip approaching 1 while all others approach 0. Basically, decidability shows the dis- tance of vi to the fully balanced assignment. With this significance score, pairwise assignments are filtered and classified as undecidable if they do not exceed a certain significance threshold. Aggregation control or re- sult adjustment, respectively, is exercised by this filtering and the handling of undecidable pairwise assignments dur- ing aggregation. Since undecidable is no valid option for a final object assignment, two handling strategies exist: one assumes that undecidable pairs are part of the same clus- ter, while the other assumes the opposite. These strategies and the filtering threshold act as parameters, allowing the merging or splitting of clusters without modifying the input clusterings[Hahmann et al., 2009]. Generally, the relation between parameters and the clus- tering result is one of cause and effect. Parameters like k for k-means or ε for DBSCAN cause different effects in the clustering result, e.g., the fusion of clusters or changes in their size. To achieve a certain effect, it is crucial to know its associated cause, which is quite challenging. The FCA method overcomes this by allowing the direct specifi- cation of desired effects, namely: merge for fewer clusters or split for more clusters. In our original work [Hahmann et al., 2009], these effects could only be applied to the whole clustering and were thus mutually exclusive. In our recent work, we enhanced the algorithm so that those effects can be applied to individual clusters. 3 Usablity Area - Visual Decision Support Until now, merging and splitting have been mutually ex- clusive and had to be set for the whole clustering. This is suffcient if the bulk of clusters requires the same operation, but it effectively prevents an individual handling of clus- ters. In tight coupling with our ensemble approach, our ef- forts in the useability area shall enable users to interpret the obtained clustering result and assist them in the decision on Figure 1: Example aggregate. whether or not clusters are stable and should be merged or spilt. With this, the result quality can be iteratively refined, whereas the provided support keeps the iteration count low. To efficiently support result interpretation and adjust- ments, we developed a visualization concept that is tightly coupled to our aggregation method [Hahmann et al., 2010b]. In general, two major groups of data/clustering visualizations can be distinguished: The first one is data- driven and tries to depict all objects and dimensions of a dataset, which leads to incomprehensible presentations for datasets exceeding a certain scale. The second one is result- driven and thus relatively scale-invariant. For example, a clustering can be depicted as a bar chart showing relative cluster sizes and/or additional values like mean or standard deviation. While the first group often shows too much in- formation, the second one often shows not enough. So, we positioned our approach as a hybrid between those groups, by visualizing the result and the relations between data and result, which are already incorporated in the soft in- put of our aggregation. In compliance with Shneiderman’s mantra, ‘overview first, zoom and filter, then details-on- demand’ [Shneiderman, 1996], our visualization features three interactive views: overview, cluster composition and relations (c&r), and the attribute view. With this, we want to enable the user to determine the clusters that need no adjustment and to decide which ones should be merged or split with our aggregation algorithm. We define stable clusters according to the general objective of clustering, that asks for clusters with high internal simi- larity that are well separated from each other. Clusters that not fullfill these criteria are candidates for adjustment. To illustrate our approach, the clustering aggregate depicted in Figure 1 is used as our paper example. It has been gener- ated using our ensemble clustering and shows a partition- ing result that needs some adjustments. In all subsequent figures, clusters are identified via color. 3.1 Overview The overview is the first view presented to the user and depicted in Figure 2. This view is completely result- driven, i.e., only characteristics of the clustering aggre- gate are shown. The dominant circle represents the clus- ters of the aggregate, whereas each circle segment corre- sponds to a cluster whose percental size correlates with the segment’s size. The radar-like gauge located on the left shows the distances between the prototypes (centroids) of all clusters. The mapping between centroids in the radar and circle segment is done via color. The radar shows a distance graph, where vertices represent centroids, and edges—invisible in our visualization—represent the Eu- 70 clidean distance between centroids in the full dimensional data space. Therefore, the radar is applicable for high- dimensional data. Since all our views are basically result- driven, we can also handle high-volume datasets without problems. The overview provides the user with a visual summary of the clustering result, allowing a first evaluation of the number of clusters and relations between clusters ex- pressed by distance and size. 3.2 Cluster Composition and Relations If the user identifies clusters of interest in the overview, e.g., two very close clusters like the pink (F) and red (G) ones in Figure 1, they can be selected individually to get more information about them, thus performing ‘zoom and filter’. Cluster selection is done by rotation of the main cir- cle. As soon as a cluster is selected, the composition and relations (c&r) view depicted in Figure 3 (for cluster F) is displayed. The selected cluster’s composition is shown by the row of histograms on the right. All histograms feature the interval [0, 1] with ten bins of equal width. From the left to the right, they show the distribution of: (i) fuzzy assignment values, (ii) significance scores for all object- centroid pairs, and (iii) significance scores for all object- object pairs in the selected cluster. For details concern- ing these scores, refer to [Hahmann et al., 2009]. Certain histogram signatures indicate certain cluster states, e.g., a stable and compact cluster is given if all three histograms show a unimodal distribution with the mode—ideally con- taining all objects—situated in the right-most (highest sig- nificance) bin. Let us regard the signature of the example depicted in Figure 3. The histograms show that many of the object- centroid and pairwise assignments are not very strong. This indicates that there are other clusters (G in the example) that strongly influence the selected cluster objects, which leaves the chance that these clusters could be merged. To support such assumptions, the relations between clusters have to be analyzed. For this, the two ’pie-chart’ gauges and arcs inside the main circle are used. The smaller gauge shows the degree of ’self-assignment’ of the selected clus- ter, while the other one displays the degree of ’shared as- signment’ and its distribution among the remaining clus- ters. These degrees are calculated as follows: each fuzzy object assignment is a vector with a sum of 1, consisting of components ranged between 0 and 1, indicating the rel- ative degree of assignment to a certain cluster, i.e., each vector-dimension corresponds to a cluster. The degree of self-assignment is calculated by summing up all compo- nents in the dimension corresponding to the selected clus- ter. This sum is then normalized and multiplied with 100 to get a percental score. The shared assignment is generated in the same fashion for each remaining cluster/dimension. The target and strength of relations between the selected cluster and others is described by the color and size of the shared-assignment slices. For easy identification, the dis- played arcs show these cluster-to-cluster relations by con- necting clusters, where the stroke width shows the strength of the relation. If a cluster is not influenced by others, it shows a very high degree of self-assignment with no outstanding rela- tions to other clusters. In contrast, the example in Figure 3 shows that the selected cluster has a noticeable relation to the red cluster. This supports the merge assumption and furthermore indicates which other cluster should be part of a possible merge. To get additional information, the inter- cluster distances can be analyzed. For this, the user can employ the ’radar’, showing that both clusters in our ex- ample are relatively close to each other (the selected clus- ter is encircled), or switch on additional distance indicators (‘details-on-demand’), as shown in Figure 4. These display the ratio of centroid-to-centroid distances—like the radar— and minimum object-to-object distances between the se- lected and the remaining clusters. If this ratio approaches 1, the respective clusters are well separated and the col- ored bars are distant. In our example, this is the case for all clusters except for the red one, where both bars nearly touch each other, showing that the minimal object distance between the clusters is much smaller than the centroid dis- tance. With this, the user can now savely state that the pink and the red cluster should be merged. To double-check, the red cluster can be selected and should show similar rela- tions to the pink one. With the c&r view, it is also possible to evaluate whether or not a cluster should be split. Candidates for a split show the following: In all three histograms, the mode of the distribution is located in one of the medium-significance bins. Additionally, they feature a reduced degree of self- assignment, but in contrast to the merge case, they have equally strong relations to the remaining clusters and are well separated in terms of the radar and distance indicators. Unfortunately, these characteristics are no clear indication for a split, e.g., non-spherical clusters can exhibit the same properties. To gain more certainty in decisions for split candidates, the attribute view has been developed. 3.3 Attribute View When we look at attributes in terms of clustering, we can state the following: If an attribute has a uniform or uni- modal distribution (in the following Φ), it is not useful for clustering because the objects of the dataset cannot be clearly separated in this dimension. In contrast, bi- or multi-modal distributions are desired, since they can be used for object separation. When we look at attributes on the cluster level, this is inverted. Regarding a cluster, it is desirable that all of its attributes have unimodal distribu- tions, since this shows high intra-cluster homogeneity. A multimodal-distributed attribute would imply that the clus- ter could be further separated in this dimension. Generally, we desire the following: On the dataset level, attributes should be dissimilar to Φ, while on the cluster level, they should resemble it as closely as possible. These are the basics for our attribute view. To calculate the similarity to Φ, we use a straightforward approach. We generate histograms, on the dataset and clus- ter level, for each attribute. From the histogram bins, those that are local maxima are selected. From each maximum, we iterate over the neighboring bins. If a neigboring bin contains a smaller or equal number of objects, it is counted and the next bin is examined; otherwise, the examination stops. With this, we can determine the maximum number of objects and bins of this attribute that can be fitted under Φ. This is the value we display in the attribute view. In Figure 5, the attribute view is depicted for the violet clus- ter E from our example. There are two hemispheres and a band of numbers between them. The band shows the at- tributes of the dataset, ordered by our computed values, and is used to select an attribute for examination (selection has a darker color). The small hemisphere on the right shows the global behavior of attributes. Each curve represents an attribute, while for the selected attribute, the area under its 71 Figure 2: AUGUR overview showing clusters and inter-cluster distances. Figure 3: AUGUR c&r view showing composition and relations for the pink cluster. Figure 4: AUGUR c&r view with activated distance indicators. 72 Figure 5: AUGUR attribute view indicating a split for the violet cluster. curve is colored. The hemisphere itself consists of two 90- degree scales, the upper for the percentage of objects and the lower for the percentage of bins that can be fitted under Φ. The start and end point of each curve show the values for the attribute on these scales. If all objects and bins fit under Φ, a vertical line is drawn and there is no color in the hemisphere. All this also applies to the left hemisphere showing the attribute in the selected cluster. For our exam- ple in Figure 5, we selected attribute 1. We can see a large colored area, showing that more than 50% of the objects and bins do not fit under Φ. If, in ad- dition, the selected cluster shows split characteristics in the c&r view, the user may assume that this cluster should be split. The benefit of this view lies in the fast and easy in- terpretability. More color in the left hemisphere indicates a higher split possibility, while the amount of color in the right hemisphere acts as a measure of confidence for the left. In terms of Shneiderman’s mantra, this view can either be considered as ‘details-on-demand’ or as an ‘overview’ and ‘zoom and filter’ for the attribute space. 3.4 Feedback The basic idea of our unifed clustering process is to take advantage of the tight coupling between our two compo- nents and change the focus for parameterization from the whole clustering to individual clusters. For this, we use the following workfow: At first, the visualization of an initial clustering aggregate like our running example is presented to the user. In this view, he/she evaluates all clusters and looks for those that need adjustment. For those, the ef- fects merge or split would be identified as appropriate. The respective parameter is then passed to the aggregation al- gorithm and only the specified clusters are subjected to a new aggregation cycle, while the rest of the result is kept. In sucessive steps, the user adjusts the clustering using the provided feedback operations until the result is satisfying. 4 Future Work Although, we already acquired several results and compo- nents for our unifying process, much work still needs to be done. In the algorithmic area the utilization of the ad- ditional information, stored in soft clusterings, has proven beneficial. Therefore we want to expand its employment in our unifying process. Our short term goal is the de- velopment of a soft density-based clustering algorithm and a soft hierarchical method later on. In combination with the soft partitioning algorithm we used so far (fuzzy c- means[Bezdek, 1981]), these three algorithms, will provide us with a good coverage, for the generation of our clus- ter ensemble. To keep parameterization easy, the aggrega- tions role as an abstraction layer must be developed fur- ther, so that users are provided with a stable and algorithm- independent interface [Hahmann et al., 2010b] for the ad- justment of clusterings. For this it is necessary to specifi- cally implement parameters like merge for each algorithm. Besides parameterization, the whole area of usability will be developed further. The employment of density- based and hierarchical clustering algorithms, leads to dif- ferent views of the data. Depending on the generating al- gorithm, different information can be derived from the ob- tained clusterings. This information must be examined, re- fined and communicated to the user in the form of novel vi- sualization concepts. Thereby, the focus lies on convenient presentation metaphors, that present information, neces- sary for the user to navigate through our process without flooding him/her with too much input. In addition, to im- prove the guidance during navigation, we will implement a semi-automatic recommender system for the selection of appropriate parameters/feedback in arbitrary stages of our unifying process. Another major aspect of our future work in the area of usability will be the integration of novel and intuitive interaction platforms like, e.g., Apple’s iPad or Nintendo’s Nunchuk controllers. These platforms have al- ready shown in practice, that intuitive interaction simplifies access to unfamiliar technologies. Concerning the architecture domain, we further focus on scalability for all our used components. Regarding the in- teraction platform previously mentioned, we will also re- gard the question, if the components of our process can be distributed and where each component should be executed. An example setting could execute the complete algorithmic stack on high performance hardware, while process inter- action is done on a portable device. In summary, our long term goal, is to combine our exist- ing components with the future work, outlined in this sec- tion to form our unifying clustering process. In its final state, this process will provide an initial clustering as start- ing point, which the user adjusts and refines in a step-by- step fashion. While the user navigates through the process, he/she is offered guidance for parameter and algorithm se- lection, as well as hints on which parts of the clustering still need refinement. 73 5 Conclusion In our work we oppose the specialisation of clustering and claim, that one clustering process can fit all application scenarios. To realize this claim, we concentrate on con- trollable ensemble-clustering, guided user interaction and emerging computation architectures. We are fully aware, that our position is bold and controversial. Nevertheless, our recent work supports our overall vision of a unifed vi- sually guided clustering process. In this paper, we summarized our current status on the algorithmic and useability area. We focus on enabling the user to evaluate an ensemble clustering result and on pro- viding decision support for result refinement with our ex- tended aggregation algorithm proposed in [Hahmann et al., 2009]. There already exist a multitude of cluster visu- alization techniques [Hinneburg, 2009], which mostly try to visualize all objects of the dataset and are thus lim- ited if data sets exceed a certain size. Furthermore, some of these techniques use complex visual concepts, which can hinder interpretation. In contrast, our visualization is tightly coupled to our aggregation method [Hahmann et al., 2009]. We do not try to visualize all objects of the data set but concentrate on the presentation of clusters as well as cluster-cluster and cluster-object relations, derived from soft cluster assignments. This result- and relation-oriented approach allows the interpretation of data sets with arbi- trary volume/dimensionality and supports the user in mak- ing decisions concerning result refinement via the men- tioned split and merge actions. In addition, focusing on ‘what’ to visualize, namely clusters and relations, allows the use of well-known and simple visual elements, e.g., pie charts and histograms, when it comes to ‘how’ to visualize. References [Bezdek, 1981] James C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. New York: Plenum, 1981. [Boulis and Ostendorf, 2004] Constantinos Boulis and Mari Ostendorf. Combining multiple clustering systems. In Proc. of PKDD, 2004. [Caruana et al., 2006] Rich Caruana, Mohamed Farid El- hawary, Nam Nguyen, and Casey Smith. Meta cluster- ing. In Proc. of ICDM, 2006. [Dimitriadou et al., 2001] Evgenia Dimitriadou, Andreas Weingessel, and Kurt Hornik. Voting-merging: An en- semble method for clustering. In Proc. of ICANN, 2001. [Dudoit and Fridlyand, 2003] Sandrine Dudoit and Jane Fridlyand. Bagging to improve the accuracy of a clus- tering procedure. Bioinformatics, 19(9), Jun 2003. [Fred and Jain, 2003] Ana L. N. Fred and Anil K. Jain. Robust data clustering. In Proc. of CVPR, 2003. [Fred, 2001] Ana L. N. Fred. Finding consistent clusters in data partitions. In Proc. of MCS, 2001. [Frossyniotis et al., 2002] Dimitrios S. Frossyniotis, Mi- nas Pertselakis, and Andreas Stafylopatis. A multi- clustering fusion algorithm. In Proc. of SETN, 2002. [Gionis et al., 2007] Aristides Gionis, Heikki Mannila, and Panayiotis Tsaparas. Clustering aggregation. TKDD, 1(1), 2007. [Habich et al., 2006] Dirk Habich, Thomas Wa¨chter, Wolfgang Lehner, and Christian Pilarsky. Two-phase clustering strategy for gene expression data sets. In Proc. of SAC, 2006. [Habich et al., 2007a] Dirk Habich, Steffen Preissler, Wolfgang Lehner, Sebastian Richly, Uwe Aßmann, Mike Grasselt, and Albert Maier. Data-grey-boxweb services in data-centric environments. In Proceedings of the 2007 IEEE International Conference on Web Ser- vices (ICWS 2007, July 9-13, Salt Lake City, Utah, USA), pages 976–983, 2007. [Habich et al., 2007b] Dirk Habich, Sebastian Richly, Mike Grasselt, Steffen Preißler, Wolfgang Lehner, and Albert Maier. BpelDT - data-aware extension of bpel to support data-intensive service applications. In Pro- ceedings of the 2nd Workshop of Emerging Web Services Technology in conjunction with the 5th IEEE European Conference on Web Services (WEWST 2007, November 26, Halle/Salle, Germany), 2007. [Habich et al., 2010] Dirk Habich, Wolfgang Lehner, Se- bastian Richly, and Wolfgang Lehner. Using cloud technologies to optimize data-intensive service applica- tions. In Proceedings of the 3rd International Confer- ence on Cloud Computing (CLOUD 2010, Miami, FL, USA, June 5-10), 2010. [Hahmann et al., 2009] Martin Hahmann, Peter Volk, Frank Rosenthal, Dirk Habich, and Wolfgang Lehner. How to control clustering results? flexible clustering ag- gregation. In Advances in Intelligent Data Analysis VIII, pages 59–70, 2009. [Hahmann et al., 2010a] Martin Hahmann, Dirk Habich, and Wolfgang Lehner. V2i: A model process for acces- sible clustering. In Appears as Demo at the 16th ACM SIKDD Conference on Knowledge Discovery and Data Mining (KDD 2010, Washington, DC, USA, July 25-28), 2010. [Hahmann et al., 2010b] Martin Hahmann, Dirk Habich, and Wolfgang Lehner. Visual decision support for en- semble clustering. In Proceedings of the 22nd Interna- tional Conference in Scientific and Statistical Database Management (SSDBM 2010, Heidelberg, Germany, June 30 - July 2), pages 279–287, 2010. [Hinneburg, 2009] Alexander Hinneburg. Visualizing clustering results. In Encyclopedia of Database Systems, pages 3417–3425. 2009. [Jain and Law, 2005] Anil Jain and Martin Law. Data clus- tering: A users dilemma. Pattern Recognition and Ma- chine Intelligence, pages 1–10, 2005. [Shneiderman, 1996] Ben Shneiderman. The eyes have it: A task by data type taxonomy for information visualiza- tions. In VL ’96: Proceedings of the 1996 IEEE Sympo- sium on Visual Languages, page 336, Washington, DC, USA, 1996. IEEE Computer Society. [Topchy et al., 2004] Alexander P. Topchy, Behrouz Minaei-Bidgoli, Anil K. Jain, and William F. Punch. Adaptive clustering ensembles. In Proc. of ICPR, 2004. [Zeng et al., 2002] Yujing Zeng, Jianshan Tang, Javier Garcia-Frias, and Guang R. Gao. An adaptive meta- clustering approach: Combining the information from different clustering results. In Proc. of CSB, 2002. 74 Efficient frequent connected subgraph mining in graphs of bounded tree-width∗ Tama´s Horva´th1,2 Jan Ramon3 1Dept. of Computer Science III, University of Bonn, Germany 2Fraunhofer IAIS, D-53754 Sankt Augustin, Germany 3Department of Computer Science, Katholieke Universiteit Leuven, Belgium tamas.horvath@iais.fraunhofer.de jan.ramon@cs.kuleuven.be Abstract The frequent connected subgraph mining prob- lem, i.e., the problem of listing all connected graphs that are subgraph isomorphic to at least a certain number of transaction graphs of a database, cannot be solved in output polynomial time in the general case. If, however, the transac- tion graphs are restricted to forests then the prob- lem becomes tractable. In this paper we gen- eralize the positive result on forests to graphs of bounded tree-width. In particular, we show that for this class of transaction graphs, frequent connected subgraphs can be listed in incremen- tal polynomial time. Since subgraph isomor- phism remains NP-complete for bounded tree- width graphs, the positive complexity result of this paper shows that efficient frequent pattern mining is possible even for computationally hard pattern matching operators. ∗A long version of this extended abstract appeared in Theoret- ical Computer Science 411(31-33), 2784–2797 , 2010. 75 On the Need of Graph Support for Developer Identification in Software Repositories [position paper] Aftab Iqbal and Marcel Karnstedt Digital Enterprise Research Institute (DERI) National University of Ireland, Galway, Ireland firstname.lastname@deri.org Abstract Software repositories from open-source projects provide a rich source of information for a wide range of tasks. However, one issue to overcome in order to make this information useful is the ac- curate identification of developers. This is a par- ticular challenge, as developers usually use dif- ferent IDs in different repositories of one project, but usually there is no kind of dictionary or sim- ilar available to map the different IDs to real- world persons. Often, they even use different IDs in the same repository. We show that the few methods suggested so far are not always ap- propriate to overcome this problem. Further, we highlight related techniques from other areas and discuss how they can be applied in this context. We particularly focus on the idea of applying graph-based methods and argue for the benefits we expect from that. 1 Motivation In Software Engineering, many tools with underlying repositories have been introduced to support the collabo- ration of distributed software development. Research has shown that these software repositories contain rich amount of information about software projects. By mining the in- formation contained in these software repositories, practi- tioners can depend less on their experience and more on the historical data [Hassan, 2008]. However, software reposi- tories are commonly used only as record-keeping repos- itories and rarely for design decision processes [Diehl et al., 2009]. The Mining Software Repositories (MSR) field analyzes the rich information available in these software repositories to discover interesting facts about the software projects [Diehl et al., 2009]. Examples of software reposi- tories are [Hassan et al., 2005] : 1. source control repositories store changes to the source code as development progresses; 2. bug repositories keep track of the software defects; 3. archived communications between project develop- ers record rationale for decisions throughout the life of a project. Software developers use these repositories to interact with each other or to solve software-related problems. For example, source-code and bugs are quite often discussed on bug tracking systems or project mailing lists. By extract- ing rich information from these repositories, one can guide decision processes in modern software development. For example, data in a source control repository could be ana- lyzed to extract the authorship information, which could be linked to additional authorship information extracted from author tags of source-code files. This could allow to keep track of which source files were committed by a developer in different periods of time. With “Linked Data Driven Software Development” (LD2SD) [Iqbal et al., 2009], we have introduced a Linked Data-based methodology to re- late data across software repositories explicitly and unam- biguously. The so created interlinked data sets can be used for querying and browsing the related information that ex- ists in these software repositories. An excerpt of an exemplary RDF representation of Java source-code using our LD2SD approach is shown in list- ing 1. Further, an example of RDF representation of a SVN Commit is shown in listing 2. 1 @prefix baetle: . 2 @prefix ld2sd: . 3 @prefix : . 4 :connect a baetle:JavaClass; 5 baetle:author ; 6 ld2sd:imports "java.io.IOException" , 7 "javax.servlet.ServletException" ; 8 ld2sd:hasMethod :connect#getConnection . Listing 1: An exemplary Java RDFication. 1 @prefix baetle: . 2 @prefix owl: . 3 @prefix : . 4 :275 a baetle:Committing ; 5 baetle:modified ; 6 baetle:author . 7 a 8 baetle:JavaSource ; 9 owl:sameAs . Listing 2: An exemplary Subversion RDFication. The listings indicate one major problem in the context of mining these repositories: often developers use different identities for each software repository and sometimes mul- tiple identities for the same repository, while interacting with these software repositories in different context. Using distinct identities for different repositories makes develop- ers appear as different entities. Hence we need methods and techniques to correctly link the different identities of software developers. This is a main requirement for, among others, being able to keep track of the developers’ activities in different software repositories. There exist only few works discussing this particular is- sue (see Section 3). However, these works either lack in 77 details or propose simplified solutions. In a series of ex- periments, we found that the so far proposed methods are not sufficient for all cases. We reflect on these results in Section 4. Thus, we argue that we need more sophisti- cated identification methods. Specifically, we propose to use graph-based mechanisms. This is based on the ob- servation that all the repository data can be represented as graphs. Moreover, these graphs are related and simi- lar to each other, as they are based on related and simi- lar activities from the same set of developers. Finally, we conclude and propose to use these graph-based and related approaches in Section 5. 2 Identities in Software Repositories In order to interact with the many software repositories that are part of an open-source project, developers usually re- quire to adopt an identity for each repository. Often, devel- opers use multiple identities for the same repository [Rob- les and Gonzalez-Barahona, 2005]. Different types of iden- tities that developers use in software repositories discussed by Robles et al. are (also summarized in Table 1): 1. In a source-code file, developers appear with many different identities, such as real life names, email ad- dresses, SVN identifiers and sometimes the combina- tion of real life name and email addresses; 2. Developers usually use multiple email addresses to send mails to the project mailing list. Sometimes the email headers contain the pair, which helps to link the email address to the developer; 3. To commit source-code on the source control reposi- tory, developers use a separate account on the version- ing system; 4. Bug tracking systems require an account associated with an email address. Data Source Identities Source Code Name Surname Source Code username@domain.com Source Code Name[username@domain.com] Source Code $subversionID Mailing List username@domain.com Mailing List Name Surname Versioning System $subversionID Bug Tracking username@domain.com Table 1: Identities found in different software reposito- ries [Robles and Gonzalez-Barahona, 2005]. In open-source projects, project outsiders (i.e., contrib- utors) also submit source-code patches to the project mail- ing list. They cannot contribute code directly to the source control repository, because they are not invited to be part of the core group of developers of that project [Hassan, 2008]. Such contributors quite often mention their names in the javadoc1 of the source-code files while modifying them to fix a particular bug or adding a new feature request to the project. When core developers finally commit the changes of contributors to the source-control repository, they very often mention the name of the contributor who provided the patch in the summary of the commit (e.g., “Patch provided by DeveloperABC ...”). Such information is very useful for analysis if extracted properly and interlinked to the correct developer name. However, the fact that the sets of users differ between the repositories increases the difficulties of mapping IDs between them. This is a further challenge beside the fact 1http://java.sun.com/j2se/javadoc/ writingdoccomments/ that we have no 1:1 mapping, neither in one repository nor over different repositories. Actually, we often encounter an n:m mapping, e.g., several email addresses might belong to the same developer and multiple developers might use the same email address. An example of the later is an address like developers@apache.org. While this might be obvious in this case, it poses a significant challenge for au- tomated approaches for developer identification. 3 Related Work To the best of our knowledge, there are only a few pub- lished works on identifying and relating the different iden- tities that developers use to interact with different tools in the field of software engineering. In [Bird et al., 2006], Bird et al. proposed an approach to produce a list of identifiers by parsing the emails and clus- tering them. The clustering algorithm to measure the sim- ilarity between every pair of IDs is based on string simi- larity between names, between emails, between names and emails, etc. Two IDs with a similarity measure lying be- low a pre-defined threshold value are placed into the same cluster. The authors use different approaches to compute the similarity measures between every pair of IDs, some of which we tested on our dataset to validate the effective- ness of these approaches (cf. Section 4.1 and Section 4.2). We found that their approach failed to provide satisfying results in our case. We use this as a motivation to argue for sophisticated approaches, such as graph-based methods. [Robles and Gonzalez-Barahona, 2005] discusses the problem of developer identification in general, but the work lacks in details about the heuristics they propose to iden- tify and match the different identities of developers. This makes it difficult to validate their approaches for solving this problem. The authors propose a technique to build one identity from another by extracting the “real life” name from email addresses, such as nsurname@domain.com, name.surname@domain.com etc. This is an approach based on pre-defined name schemes, a variant that we also evaluate for our case in Section 4.3. Similarly, this method did not provide satisfying results for the data set we investi- gated. Further, they try to match user names obtained from CVS to email addresses (excluding the domain after “@”). This approach relies on string similarity again. In general, the problem is related to duplicate detec- tion. While research in this area mostly refers to identi- fying duplicates in the same data set, the techniques might be mapped to the case of matching over different data sets. However, they are tailor-made for identifying different IDs of the same developer inside one repository. [Naumann and Herschel, 2010] provides a nice overview of this re- search direction. Interestingly, the authors also reflect on graph-based duplicate detection. The general idea is to use structural measures of graphs in order to identify different nodes that are similar to each other, wrt. these structural features. Two very similar nodes are likely referring to the same person. Bringing these approaches to the case of multiple graphs forms a basis for several works on net- work de-anonymisation [Narayanan and Shmatikov, 2009; Wondracek et al., 2010]. The idea is similar: use the struc- tural information from one (partially) known graph and use it to identify similar nodes in another graph, one of which we have only (maybe limited) structural information. Again, those nodes that are very similar are likely refer- ring to the same person. This requires input graphs that are somehow related to each other. Further, the more adversary information we have, the higher accuracy we can achieve. We explore this approach further for our case, where we have several different input graphs, in Section 5. 78 4 Preliminary Results In this section, we present results gained on data from a rep- resentative open-source project: Apache Tomcat2. We exe- cuted some experiments to evaluate the methods described in [Bird et al., 2006] and [Robles and Gonzalez-Barahona, 2005]. In the following, we refer to email addresses by using only the term email. We gathered data by parsing the emails from Apache Tomcat mailing lists starting from 2005 to date. For every email containing a written name and the according email, we extracted the from and to field from the email header to produce a list of pairs. We parsed 49,927 emails from the mailing list and produced 1261 distinct pairs. We use these results as a ground proof to validate the different techniques that could be utilized to interlink the different identities of a developer. The reason for using data from a set where we actually know the exact matching is simple: it is the only way to provide some ground truth. It is not possible to ex- tract similar information for other types of IDs. The only, clearly impractical, way would be to ask all involved per- sons for the different IDs they use. However, we strongly believe that the gained results are significant for these other types of IDs as well. Moreover, the methods we tested were proposed for matching email addresses. To check the effectiveness of each approach, we first used the approach to find a matching developer name for each email. Then, we computed two versions of precision P and recall R values for each approach using the follow- ing equations: P1 = #emails with at least one correct match #matched emails (1) R1 = #emails with at least one correct match #total emails (2) P2 = #emails with exactly one correct match #matched emails (3) R2 = #emails with exactly one correct match #total emails (4) By determining these different types we gain some in- teresting additional knowledge. For P1 and R1 we count a hit if we found at least one correct match, i.e., even if we found more incorrect matches. This provides an “op- timistic” quality assessment, as the correct match(es) are found. Anyhow, in reality one would still have to filter out the wrong matches. Thus, we determine a “pessimistic” quality assessment in P2 and R2. If there exist correct and wrong matches for a given email this is not regarded as a hit. In order to compute the similarities between developer names and email addresses or between two email addresses (see Section 4.1 and Section 4.2), we used the Levenshtein edit distance [Ukkonen, 1985] algorithm, as suggested in [Bird et al., 2006]. The match(es) for an email are those developer names with the lowest distance. Clearly, there is no sense in allowing arbitray large distances, as this would mean that one string can be transformed into a completely other one. Thus, we tested three different threshold values: 1. the maximal length of both strings 2. the minimal length of both strings 3. a fixed threshold value of 4 2http://tomcat.apache.org/ 4.1 Similarity between Names and Emails In the first test we matched email addresses (excluding the domain after “@”) and developer names by doing a pairwise comparison. Based on the above described ap- proaches, we computed the precision and recall values, which are shown in Table 2. Threshold P1 in % R1 in % Maximum length 67.1 57.1 Minimum length 57.4 46.3 Fix Threshold 54.4 30.3 Threshold P2 in % R2 in % Maximum length 24.3 20.1 Minimum length 18.4 15.3 Fix Threshold 48.4 27.2 Table 2: Names-email similarity results. The results for P1 and R1 are actually quite good, where the maximal-length threshold seems to be the best choice. However, the accuracy is not high enough. Moreover, the values of P2 and R2 show that these methods produce a lot of false matches. As the fixed threshold seems to be the most “stable” in that context, we can learn that high thresholds tend to match very different strings – and thus create more false positives. In general, all methods are not very well suited for practice on a data set like ours. 4.2 Similarity between Emails Besides names and emails, it is very likely that email ad- dresses are textually similar to each other if they belong to the same developer. Thus, in the second test, we computed the pairwise similarities between email addresses (exclud- ing the domain after “@”) and determined match(es) for one email by choosing those with the lowest distance. This was also suggested in [Bird et al., 2006]. The resulting precision and recall values are shown in Table 3. Threshold P1 in % R1 in % Maximum length 3.2 3.2 Minimum length 3.1 3.0 Fix Threshold 7.2 2.4 Threshold P2 in % R2 in % Maximum length 2.0 2.0 Minimum length 1.5 1.5 Fix Threshold 3.5 1.3 Table 3: Email similarity results. Clearly, this idea does not work at all for our data. The low values show that assuming similarity between emails in order to find matches is an absolutely inappropriate so- lution for our case. 4.3 Matching based on Name Schemes More likely, email addresses are built from the “real life” name of a developer. As suggested by [Robles and Gonzalez-Barahona, 2005], in the third test we tried to identify matching developer names for emails (excluding the domain after “@”) by checking the relation between them based on different name schemes. For example, aftab.iqbal@deri.org matches to Aftab Iqbal based on the name scheme name.surname. Based on our observations on different open-source projects we selected different name schemes, which developers quite often used to build their email addresses : name.s (e.g., aftabi), n.surname (e.g., aiqbal), n.s (e.g., ai), na.su (e.g., afiq), name.surname (e.g., aftabiqbal), e.surname (e.g., biqbal) and name (e.g., aftab). Note that the dot in a name scheme is only for illustration purposes – in the actual matching we do not use dots and ignore them in emails if present. Pre- cision and recall as computed for each of the name scheme are shown in Table 4. 79 NameScheme P1 in % R1 in % name.s 83.0 2.0 n.surname 96.0 9.2 n.s 33.3 1.0 na.su 50.0 0.2 name.surname 99.4 14.0 e.surname 100.0 0.6 name 72.0 3.5 NameScheme P2 in % R2 in % name.s 65.5 2.0 n.surname 94.0 9.0 n.s 5.0 0.1 na.su 0.0 0.0 name.surname 99.4 14.0 e.surname 100.0 0.6 name 72.0 3.5 Table 4: Namescheme - email results. To our surprise, most of the schemes result in very high precision values, for both types of quality measures. This means that if we find a match with these techniques, it is very likely a correct match. However, the in contrast very low recall values show that this method is capable of iden- tifying only a handful of all matches. Thus, on its own, it is also not suited for our use case. Summarizing, we can state that all three tested ap- proaches do not prove to be suited for our case. As we have some good results in parts, it seems to be promising to combine these techniques with some more advanced ap- proaches. By inspecting the precision and recall values, we can conclude that combining the methods among them- selves is not promising. Thus, in the following section, we discuss graph-based methods to complement them. 5 Graph Support: A New Way of Developer Identification As briefly mentioned in Section 3, there are several ap- proaches that tackle a problem similar to the one discussed here on the basis of graphs. The main principle is that if we have two related graphs, we can use the structural fea- tures of nodes from one graph to reflect on the nodes from the other graph. This is particularly efficient in the context of social network graphs [Narayanan and Shmatikov, 2009; Wondracek et al., 2010], i.e., graphs built from social rela- tions between persons. The approach suggested by us builds on the same prin- ciple. Out of the different repositories, we can construct several graphs: developer-to-file relations, developer-to- developer relations (e.g., working on the same files), email- to-email conversations, bug-to-developer relations, etc. While not all of these graphs are from social interactions, they are clearly all related to each other. Using some base knowledge, for instance gained from the name scheme methods described above, we can try to identify pairs of matching nodes. In order to map SVN IDs to developer names we could for instance build “file fingerprints” (sim- ilar to the group fingerprints in [Wondracek et al., 2010]) and compare them. There are plenty of different options. The main task is to identify the kind of relations that con- nect two or more graphs. Unfortunately, we are currently restricted in the graphs we can use. This is mainly due to the fact that we have ground truth only for emails and developer names. The only graphs we can extract from the Apache Tomcat project containing both types of IDs are email-to-email conversa- tions and developer-to-developer relations (based on com- mon files). However, we found that both graphs are sim- ilar in certain structural features, such as centrality values and degree distribution. This is quite intuitive, as develop- ers that handle more classes (central position, higher de- gree) can be assumed to communicate more by email. In the very near future we plan to develop according meth- ods for developer identification based on this observation. Later, we plan to extend these approaches to all types of graphs and IDs we find in the open-source software repos- itories. Moreover, we believe it is promising to extend the raw graphs we find by graphs derivated from other knowl- edge. For example, we will analyse the actual content of email conversations. Using according NLP techniques, we will very likely find more relations between developers and other mail users. With confidence, we could show that existing methods are not suited for all cases. With the experiences from duplicate detection and de-anonymization, we strongly be- lieve to have identified the right way for sophisticated de- veloper identification. We are looking forward to explore this very interesting and relevant field of research in more detail in very near future. Acknowledgements This material is based upon works jointly supported by the Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2) and under Grant No. 08/SR- C/I1407 (Clique: Graph & Network Analysis Cluster) References [Bird et al., 2006] Christian Bird, Alex Gourley, Prem De- vanbu, Michael Gertz, and Anand Swaminathan. Min- ing email social networks. In MSR ’06: Int. Workshop on Mining Software Repositories, pages 137–143, 2006. [Diehl et al., 2009] Stephan Diehl, Harald C. Gall, and Ahmed E. Hassan. Guest editor’s introduction: Special issue on mining software repositories. Empirical Softw. Eng., 14(3):257–261, 2009. [Hassan et al., 2005] Ahmed E. Hassan, Audris Mockus, Richard C. Holt, and Philip M. Johnson. Guest editor’s introduction: Special issue on mining software reposito- ries. IEEE Trans. Softw. Eng., 31(6):426–428, 2005. [Hassan, 2008] Ahmed E. Hassan. The Road Ahead for Mining Software Repositories. In Future of Software Maintenance (FoSM) at Int. Conf. on Software Mainte- nance(ICSM), 2008. [Iqbal et al., 2009] A. Iqbal, O. Ureche, M. Hausenblas, and G. Tummarello. LD2SD: Linked Data Driven Soft- ware Development. In Int. Conf. on Software Engineer- ing and Knowledge Engineering (SEKE 09), 2009. [Narayanan and Shmatikov, 2009] Arvind Narayanan and Vitaly Shmatikov. De-anonymizing social networks. In SP ’09: IEEE Symposium on Security and Privacy, pages 173–187, 2009. [Naumann and Herschel, 2010] Felix Naumann and Melanie Herschel. An introduction to duplicate de- tection. Synthesis Lectures on Data Management, 2(1):1–87, 2010. [Robles and Gonzalez-Barahona, 2005] Gregorio Robles and Jesus M. Gonzalez-Barahona. Developer identifi- cation methods for integrated data from various sources. SIGSOFT Softw. Eng. Notes, 30(4):1–5, 2005. [Ukkonen, 1985] Esko Ukkonen. Algorithms for approx- imate string matching. Inf. Control, 64(1-3):100–118, 1985. [Wondracek et al., 2010] Gilbert Wondracek, Thorsten Holz, Engin Kirda, and Christopher Kruegel. A prac- tical attack to de-anonymize social network users. Tech- nical report, 2010. http://www.iseclab.org/ papers/sonda-TR.pdf. 80 Separate-and-conquer Regression Frederik Janssen and Johannes Fu¨rnkranz TU Darmstadt, Knowledge Engineering Group Hochschulstraße 10, D-64289 Darmstadt {janssen,juffi}@ke.tu-darmstadt.de Abstract In this paper a rule learning algorithm for the prediction of a numerical target variable is pre- sented. It is based on the separate-and-conquer strategy and the classification phase is done by a decision list. A new splitpoint generation method is introduced for the efficient handling of numer- ical attributes. It is shown that the algorithm performs comparable to other regression algo- rithms where some of them are based on rules and some are not. Additionally a novel heuris- tic for evaluating the trade-off between consis- tency and generality of regression rules is intro- duced. This heuristic features a parameter to di- rectly trade off the rule’s consistency and its gen- erality. We present an optimal setting for this pa- rameter based on an optimization on several data sets. 1 Introduction The accurate prediction of a numerical target variable is an important task in machine learning. There are several domains that can benefit from regression methods. For ex- ample, in the domain of financial data, it is a crucial issue to predict the volume of a credit. Here, classification al- gorithms can only provide a decision of whether or not a credit should be given but are not capable of predicting its size. In the machine learning community the main task still is to predict a categorical outcome but through the last years the task of regression has gained more and more interest. Regression has its roots in the statistical community from where several algorithms were proposed over the years. The list includes the popular linear regression that is very efficient but still shows a good performance. The main ad- vantage in using means of machine learning lies in the com- prehensibility of the models. For instance, simple IF-THEN rules are directly interpretable by a data miner. Rules and trees are the two variants of interpretable models used in machine learning. As rules are typically more expressive because they are able to overlap, the goal of this work is the design of a rule learning system that on the one hand has a performance that is comparable to state-of-the-art al- gorithms and that on the other hand yields models that are still human-readable. There are several strategies to induce a set of rules. Some of them rely on the gradient-descent algorithm for finding a rule ensemble that optimizes some loss function. Oth- ers convert given trees into sets of rules. However, one of the most popular strategy in classification is the so-called separate-and-conquer paradigm. Due to its simplicity and its good performance in classification1, we decided to use this strategy to design the algorithm. The paper is started with a brief recapitulation of related work. It is continued by a short introduction of separate- and-conquer rule learning for classification. Then the adap- tations that are necessary to extent separate-and-conquer rule learning for classification to regression are specified. Some error measures are introduced and the handling of nu- merical attributes is described. Then the experimental setup and the evaluation methods are specified and the method for optimizing the parameters of the algorithm is described. The following section describes the results and the last one concludes the paper. 2 Related work The separate-and-conquer strategy is not used frequently for learning regression rules. Exceptions include predic- tive clustering rules (PCR) [Zˇenko et al., 2005], the FRS system [Demsˇar, 1999], which is a reimplementation of the FORS system [Karalicˇ and Bratko, 1997], and M5RULES [Holmes et al., 1999; Quinlan, 1992; Wang and Witten, 1997] which generates the regression rules from model trees and uses linear models in the head of the rules. Predic- tive clustering rules are generated by modifying the search heuristic of CN2 [Clark and Niblett, 1989]. Instead of ac- curacy or weighted relative accuracy, it uses a heuristic that is based on the dispersion of the data. This algorithm also follows a different route by joining clustering approaches with predictive learning. The R2 system [Torgo, 1995] works to some extent analogously to other separate-and-conquer algorithms by selecting an uncovered region of the input data. But this selection differs from the mechanism used in regular separate-and-conquer learning. However, it also allows for rules to overlap and the rules predict linear models instead of a single target value. Other mechanisms for learning regression rules are mainly based on ensemble techniques as used in the RULE- FIT learning algorithm [Friedman and Popescu, 2008] or in REGENDER [Dembczyn´ski et al., 2008]. The first algo- rithm performs a gradient descent optimization, allows the rules to overlap, and the final prediction is calculated by the sum of all predicted values of the covering rules instead of that of a single rule. The second one uses a forward stage- wise additive modeling. 1The famous RIPPER algorithm [Cohen, 1995], one of the most accurate rule learners for classification is also based on the separate-and-conquer paradigm. 81 Another popular technique to deal with a continuous target attribute is to discretize the numeric values as a preprocessing step and afterwards employ regular ma- chine learning methods for classification. Research fol- lowing this path can be found in [Torgo and Gama, 1996; Weiss and Indurkhya, 1995]. The main problem here is that the number of bags for the discretization process is not known in advance. For this reason the performance of this technique strongly depends on the choice of the number of classes. 3 Separate-and-conquer rule learning and Regression Most inductive rule learning algorithms for classifica- tion employ a separate-and-conquer strategy for learning rules that allow to map the examples to their respective classes. The basic idea of the separate-and-conquer strat- egy [Fu¨rnkranz, 1999] is to cover a part of the example space that is not explained by any rule yet (the conquer step). This region is covered by searching for a rule that fulfills some properties, i.e., has a low error on this parti- tion of the input space. After this rule is found, it is added to a set of rules, and all examples that are covered by the rule are removed from the data set (the separate step). Then, the next rule is searched on the remaining examples. This procedure lasts as long as (positive) examples are left. The two constraints that all examples have to be covered (also called completeness) and that no negative example has to be covered in the binary case (consistency) can be relaxed so that examples remain uncovered in the data or negative examples are covered by the set of rules. This relaxation mostly is driven from preventing overfitting. In the end, the algorithm returns a set of subsequently learned rules. For classification of unseen examples, each of the rules in the list is tested whether or not it covers the example. The first rule that covers the example (i.e., matches all the given attribute values) “fires” and predicts the value of the example by using the head of the rule. If no rule in the (decision) list covers the example, the prediction is given by a default rule that usually predicts the majority class in the data. In the following we will have a closer look at the main step of the algorithm2, namely how to navigate through the search space. Most of the algorithms build all possible can- didate rules from the data by using all values for a given attribute and include these attribute-value pairs in a can- didate rule. Thus, an attribute-value pair (a condition) is added to a given candidate rule which results in a refined candidate rule, i.e., a refinement of the former candidate rule. For nominal attributes these values are given from the data itself but for numerical attributes usually all possible splitpoints are used. The splitpoints are calculated as the mean between two adjacent (previously sorted) values. Finally, when all candidates with one condition are gen- erated a heuristic is used to determine the best one. Then, the best candidate rule is stored and refined to yield all re- finements with two conditions. For nominal attributes the used ones are stored (and not used any more) and for nu- merical ones the relations < and ≥ are evaluated. This means that a numerical attribute may occur twice in one rule by using it for a test on < and on ≥. This procedure usually runs as long as negative examples are covered. In 2For pseudo-code of the algorithm see [Janssen and Fu¨rnkranz, 2010b]. this step the algorithm also ensures that a minimum number of examples is covered (a user-given value). For all experi- ments (cf. Section 5 and 6) we fixed the minimum coverage to 3 examples. Note that missing attribute values can never be covered and the attribute with the missing value is ignored. When the class of an instance is missing it is removed from the dataset in a preprocessing step. As search strategy simple hill-climbing was used. There are many different heuristics to navigate the search (for an overview see [Fu¨rnkranz and Flach, 2005]) but all of them are trying to maximize the coverage of positive ex- amples (p) and to minimize the negative coverage (n). To reach this objective different ways are employed but usu- ally, in some way, there has to be a combination of consis- tency (i.e., the error or the negative coverage) of the rule and its generality (i.e., the number of examples that are covered). Most of the heuristics have a fixed trade-off but some of them feature a parameter to adjust it. In previous work the parameters of some of these heuristics were tuned, so that they achieved the most accurate trade-off between consistency and coverage [Janssen and Fu¨rnkranz, 2010a]. In this work we follow the same path by defining such a parametrized heuristic and by tuning its parameter to yield the best fit between these two objectives. 3.1 Separate-and-conquer for regression As noted above some of the properties that come with cat- egorical binary data do not apply for numerical target vari- ables. Thus the algorithm had to be adapted in several ways. First of all, each evaluation of a single splitpoint re- quires a scan through the data. For this reason a novel split- point method has to be developed that allows using only a subset of all splitpoints to prevent the algorithm from get- ting too inefficient (cf. Section 3.3). Mechanisms for the efficient computation of splitpoints known from classifica- tion proved to be inefficient in our first experiments. The heuristics that were introduced for the task of classi- fication were not suitable for regression either. In Regres- sion there is no notion for positive or negative examples. Hence, an alternative error measure has to be defined. The default rule also has to be adapted because there is no ma- jority class any more. A simple way to do this is to take the mean over all remaining examples as prediction. Another way would be to take the mean of all examples. We ex- perimented with both settings (cf. Section 6). Finally, the methods for evaluating the final model have to be adapted because using measures like accuracy (the percentage of correctly predicted examples) is not practicable any more. 3.2 Error measures for regression There are several ways to compute the error of a rule or of a complete model for regression tasks. This section gives an overview of the measures we used. In the followingm denotes the total number of examples in the (current) dataset, y is the value of the current exam- ple, y¯ is the value predicted by the rule, and y′ is the mean over all examples. The mean absolute error is the mean of the sum of the absolute errors of all examples that are covered by the rule LMAE = 1 m m∑ i=1 |yi − y¯i| (1) The root mean squared error is defined by taking the root of the mean squared error 82 LRMSE = √√√√ 1 m m∑ i=1 (yi − y¯i)2 (2) The problem of LRMSE is that it is domain-dependent. As the amplitude of the values in the domain is chang- ing the amplitude in the error measures is changing as well. Thus, the errors are not comparable among different datasets. For using this measures to compute the heuris- tic value this may not be a problem because only candidate rules are compared to each other. But if a combination of the error and the coverage is taken this becomes crucial due to normalization issues. For the normalization of the LRMSE usually the devia- tion from the mean is used which is given by Ldefault = m∑ i=1 (yi − y′)2. (3) Thus, the relative root mean squared error3 becomes LRRMSE = LRMSE√ 1 m · Ldefault . (4) These measures can be used for evaluating a single can- didate rule but also for evaluating a whole theory (an or- dered set of rules). Note that the LRRMSE has its best value at 0 when each example is classified with the correct value. Theoretically, its worst value is 1 because only the value calculated by Ldefault was used for predicting the value of an unseen example. Due to the split in 10 folds that happens during the cross-validation the values of the LRRMSE can become bigger than 1. A stratification of re- gression values is not possible and this may results in splits where the test fold contains examples that share values that have never appeared in the training fold. In these cases the prediction with the value of Ldefault would have been su- perior to the values predicted by the learned model. Note that some databases are practically even encode random- ness because the LRRMSE values on those where always bigger than 1 independently from the used algorithm. To derive the relative coverage the number of covered examples divided by the total number of examples is taken. relCov = 1 m · coverage(Rule). (5) We decided to combine the error and the generality of a rule by using the rrmse and the relative coverage hcm = α · (1− LRRMSE) + (1− α) · relCov. (6) Here, the parameter α enables a trade-off between the error and the generality of the rule. For α = 1 the relative coverage is ignored and thus the rules are evaluated solely by inspecting their error. This setting would yield a model that consists only of rules that cover a single example in the data and would thus clearly lead to overfitting4. The other extreme is to set α = 0 which results in ignoring the error of the rule. A model built with this setting would only consist of the default rule, because its coverage is the highest that could be achieved by any rule. The optimal trade-off lies somewhere in between these two extremes. 3In the remainder of the paper abbreviated with rrmse. 4Note that this holds only in a scenario where a rule may cover a single example. Figure 1: Example of the splitpoint clustering method 1 2 3 4 5 6 7 8 9 10 2 63211231.534 Attribute Value Target Value Step 1 Step 3 Step 2 3.5 5.5 6.5 9.5 3.5(0.5) 3 (0.67) 1.5(0.5) 2 (0.67) 2.25(0.75) The heuristic hcm is an adaptation of a previously intro- duced heuristic called relative cost measure [Fu¨rnkranz and Flach, 2005]. Its formula is given by hcr = cr · p P − (1− cr) · n N . (7) where p is the positive coverage of the rule, n is the neg- ative coverage of the rule, P is the total number of positives and N is the total number of negatives. It was designed for evaluating classification rules thus relying on coverage statistics. In previous work [Janssen and Fu¨rnkranz, 2010a] an optimal setting for the param- eter of hcr was found (cr = 0.342 ). It encodes a clear favor of the consistency (explained by pP ) over the cover- age (denoted by nN ). It achieved good performance among different classification heuristics as shown in [Janssen and Fu¨rnkranz, 2010a] since it was the second best heuristic of all. Thus the motivation to modify exactly this heuristic was the good performance and that it is best suited to be adapted to regression. 3.3 Splitpoint processing As noted above, the generation of all possible splitpoints would be too costly. To avoid this a method to restrict the splitpoints for an attribute was developed. The basic idea comes from supervised clustering. Thus, we try to iden- tify regions in the data of the current attribute that share a small error computed on the target variable. The aim of the clustering is to yield partitions of the attribute that share a low error in the hope that the error of a rule that covers these regions will also be low. Clustering stems from the same motivation because it also guarantees that each clus- ter has the lowest possible error. The user has to define how many clusters and hence how many splitpoints are desired. We experimented with different settings but surprisingly a rather low number of splitpoints seemed to be sufficient (cf. Section 5.1). Figure 1 displays how the cluster algorithm works. In the example in Figure 1 the attribute has 10 values moving equidistantly from 1 to 10. The values depicted in blue are those of the target attribute of the respective example. In the first step the attribute values are ordered ascending and each value becomes a cluster containing exactly this value. Then two adjacent clusters are searched for which the error when using the mean of the two target values as prediction is the lowest. In the example these are the clusters 2 and 3, 7 and 8, and 8 and 9. Though the objective in the first step is to join two adjacent clusters both 2, 3 and 7, 8 are joined (its arbitrary whether to join 7 and 8 or 8 and 9). The mean of the first cluster is 3.5 = 4+32 and the second one has a mean of 1.5 (depicted in black in Figure 1 above the number ray). If the mean absolute error is taken, both clusters have an error of 0.5, which is shown in brackets and in red in the corresponding figure. An error of 0.5 = |4−3.5|+|3−3.5|2 is also the lowest error that can be achieved given the example data. 83 In the second step the function is executed recursively and again those clusters are joined that have the lowest er- ror among all possible clusters. So, in this step, cluster 1 is joined with the second cluster and the cluster with a value of 9 is joined with the third cluster. The error of both clusters grows to 0.67 because adding the respective example does yield a raise of the error (i.e., LMAE = |2−3|+|4−3|+|3−3| 3 = 0.67 for the first cluster). Joining any of the untouched clusters leads to a higher error which means that the cluster with next lowest error is built in step 3. After the second step two clusters containing at least 2 examples were built and therefore 5 splitpoints exist. In the example the user given number of splitpoints is set to 4. Hence another cluster has to be built until the algorithm is finished. This last cluster is derived by joining the clusters with the values 4 and 5 and it yields an error of 0.75. After the third step 3 clusters are built and the splitpoints are simply derived by taking the mean between the values of two adjacent clusters or two values if the cluster contains only one example. The 4 splitpoints are 3.5, 5.5, 6.5, and 9.5 (depicted in red in Figure 1 in the number ray). We have evaluated the effectiveness of the splitpoint method by comparing it to the usage of another splitpoint method where n splitpoints are selected equidistantly. The results of this comparison are shown in Section 6.1. For the com- putation of the error the mean absolute error was used. This choice is arbitrary but experiments with the root mean squared error did not yield any performance difference. 3.4 Parameters of the algorithm There are 3 parameters the user has to specify. • The parameter of the heuristic, • the number of splitpoints (splitpoint-parameter), and • the percentage of examples that are left uncovered (left-out-parameter). The parameter of the heuristic is optimized with a greedy procedure that narrows down the region of interest. This procedure is described in detail in Section 5. The number of splitpoints is crucial for the runtime of the algorithm. This value was optimized by testing different values (cf. Section 5.1). The last user-given parameter is the percentage of exam- ples that are left uncovered by the outer loop of the algo- rithm. This parameter clearly depends on the dataset. Dur- ing the experiments there was some evidence that we had included databases that basically encode randomness and for those learning anything results in worse performance (e.g. the dataset quake). 4 Experimental setup To optimize the 3 parameters some datasets were used for tuning and were split into 2 folds of equal size. On the first fold of each dataset, all steps of the optimization procedure were done and afterwards the best model was evaluated on the second fold. This is also done vice versa. Hence, the experiments yield two configurations of the same al- gorithm that only differ in the parametrizations. A test of the parametrizations on the hold-out folds of the tun- ing datasets is the first step of the evaluation. Additionally some insights are gained by evaluating the two variants also on those datasets that were used during the optimization. To complete the evaluation, the two resulting configura- tions were also evaluated on some datasets that were not used for any optimization purposes. The aim of the experiments was to optimize the parame- ters of the algorithm on a set of diverse datasets to capture characteristics of a wide variety of different datasets. Our hope was that by taking a set of datasets that are very differ- ent the parameters would be more stable. For this reason, we selected 29 databases in total from the UCI-Repository [Asuncion and Newman, 2007] and from Luis Torgos web- site5. The datasets were divided into 20 sets that were used during the tuning phase and 9 sets that were only used for evaluation purposes. The tuning datasets were abalone, auto-mpg, auto-price, breast-tumor, compressive, concrete-slump, cpu, delta- ailerons, echo-month, forest-fires, housing, machine, pbc, pyrim, quake, sensory, servo, strike, triazines, winequality-white As mentioned above the main motivation to select these datasets was to capture a lot of different learning prob- lems. Thus, the number of nominal and numerical at- tributes should be different among the databases and the domains from which they origin should be as diverse as possible. The 9 datasets that were used to evaluate the algorithms were auto93, auto-horse, cloud, delta-elevators, meta, r wpby, stock, veteran, winequality-red The distribution among the 20 tuning databases in terms of nominal and numerical attributes as well as in terms of size should be approximately the same as in the test- ing datasets. Therefore both bags of data contain some small, some medium and some big databases. For a de- tailed overview of the datasets see [Janssen and Fu¨rnkranz, 2010b]. 4.1 Evaluation methods The primary method to evaluate the algorithm was the rrmse. The advantage of this evaluation measure clearly lies in its domain-independency. For some of the experi- ments it would take too much space to include results on every single dataset. In those cases our means for evaluat- ing the different algorithms was to average the results over all datasets. We are aware of the problems that come with averaging results over many different domains (i.e., some databases may be outliers with huge variance compared to the majority of the other datasets) and hence include a Friedman-Test with a post-hoc Nemenyi-Test as suggested in [Demsar, 2006]. The resulting CD-charts give insights how good the algorithms perform by evaluating their rank- ing independently from using average accuracy. There are other ways to evaluate regression algorithms domain-independently. The correlation coefficient for in- stance is also widely used. But there are some draw- backs from using this method regarding rule learning algo- rithms. For results including the correlation coefficient and a discussion of the drawbacks see [Janssen and Fu¨rnkranz, 2010b]. 5 Optimizing parameters For the split into the 2 folds, all datasets were randomized in advance using the unsupervised randomize function of weka [Witten and Frank, 2005]. All evaluation measures were computed using one run of a 10-fold cross validation. 5These databases can be downloaded at http: //www.liaad.up.pt/˜ltorgo/Regression/ DataSets.html. 84 Table 1: Results for the splitpoint computation and left- out-parameter (average rrmse over the 5 parametrizations of the heuristic) parameter (splitpoint) folds 1 folds 2 1 1.0675 1.0540 3 0.9929 1.0256 5 1.0132 1.0261 7 1.0067 1.0245 9 0.9992 1.0209 11 1.0126 1.0427 19 1.0163 1.0240 parameter (left-out) folds 1 folds 2 0 0.9929 1.0209 0.01 0.9787 1.0221 0.02 0.9776 1.0182 0.03 0.9759 1.0156 0.05 0.9739 0.9940 0.1 0.9704 0.9835 0.2 0.9736 0.9701 5.1 Optimization of the splitpoint and the left-out parameter Though these two parameters are likely to have a small de- viation in performance among different databases, we de- cided to optimize them first and fix them before we start optimizing the parameter of the heuristic. We believe that the parameter of the heuristic has a stronger influence on the performance of the algorithm than the other two param- eters. This is mostly because the heuristic is used to eval- uate every single candidate rule and therefore is the most important factor in the algorithm. Note that the heuristic also has a strong influence on the quality of the rules and on the total number of rules found by the algorithm. The other two parameters are also influencing the performance of the algorithm but rather in an indirect way by assuring to provide splitpoints of good quality and by leaving those examples untouched that are hard to learn. For this reason, we focussed more on the heuristic pa- rameter than on the other two. If we had concentrated more on these two the performance of the algorithm could have been become a bit better but our main idea was to derive stable parameters for the heuristic and we believe that the gain in performance depends stronger on the heuristic pa- rameter than on the other two (cf. the experimental results in section 5.2). To start optimizing the splitpoint parameter the other two had to be fixed. In advance it is not known how to deter- mine these values. Thus, the left-out-parameter was fixed to 0, therefore all examples have to be covered. In this case the default rule is built by using the mean of all examples. In other cases where examples remain uncovered it is built by using the mean of all uncovered examples. On the con- trary, it is not obvious what parameter value can be used for the heuristic. For this reason, 5 different values were used during the optimization. To make a choice, the two extremes were included (α = 0, and α = 1), and some values in between, namely 0.4, 0.5, and 0.6. These values were used to include different preferences of the heuristic. Clearly, using only two parameters would be suboptimal because there is some evidence that the optimal parame- ter rather would lie somewhere in the middle of the do- main than at the beginning or the end of it [Janssen and Fu¨rnkranz, 2010a]. We expected the curve yielded by plot- ting the parameter over the error to be shaped like a U, where the two extreme values would results in a rather bad performance and the optimal value lies somewhere around 0.5 (cf. Section 5.2). To have a combined error mea- surement for the optimization procedure the mean over the rrmse of these choices was taken. In the beginning, the left-out-parameter was fixed to a value of 0 yielding a starting point for the optimization of the splitpoint-parameter. To find the best value some intu- itive values (1, 3, 5, 7, 9, 11, and 19) were used. All values bigger than 19 were skipped because a clear gain in run- time performance should be achieved. Using huge values would result in practically using all possible splitpoints and thus would not improve the algorithm’s runtime6. Table 1 (left table) shows the results for the two op- timization procedures (for the two folds of the tuning datasets). As can be seen the best number of splitpoints was 3 on the first folds and 9 on the second folds (the low- est error is depicted in bold in the figure). On the first folds, however, using 9 splitpoints yields the second best rrmse which lacks only 0.0063 behind the best performing num- ber of splitpoints. On the second folds, using 9 splitpoints performed best followed by using 19 splitpoints. Using 3 splitpoints lacks 0.0047 in terms of rrmse behind the best one and therefore is the fourth best method. Nevertheless, the gap between different parametrizations seems to be big- ger on the first folds than on the second ones. Regarding the split of all tuning datasets into 2 folds, these results seem to reflect the randomness in splitting the datasets. After the optimization of the splitpoint parameter the same procedure was employed to the left-out-parameter of the algorithm. Here, the splitpoints were already fixed to 3 for the algorithm tuned on the first folds and to 9 for the variant tuned on the second folds. To find the best value also some intuitive parameters were used. Thus, the values 0, 0.01, 0.02, 0.03, 0.05, 0.1, and 0.2 were tested during this optimization. The setting where all examples are cov- ered by rules was included to make sure that is more effec- tive to leave some parts of the data uncovered. Clearly, an optimal setting is dataset-dependent. But it also depends on the quality of the induced rules. For numerical target vari- ables it can be useful to cover only those parts of the data that share some common characteristics. For the remainder of the data it could be beneficial to treat them independently from their characteristics. As can be seen in Table 1 (right table) two different pa- rameters performed best on the two folds. Practically this can be attributed to the same reasons that were already dis- cussed during the optimization of the splitpoint parameter. Thus, on the one hand the randomized split of the data into 2 folds of equal size could have manipulated the charac- teristics of the datasets. On the other hand it could also be possible that there is no unique best value for leaving examples uncovered. The results also show that leaving examples uncovered is mandatory for the performance of the algorithm. 5.2 Optimization of the heuristics parameter For the optimization of the heuristics parameter a frame- work similar to the one introduced in [Janssen and Fu¨rnkranz, 2010a] was used. It employs a binary search to find the best parameter and was proven to yield stable pa- rameters for classification heuristics as shown in [Janssen and Fu¨rnkranz, 2010a]. The search is started with a range of intuitively appealing parameters. Thus, the two extremes of 0 and 1 are tested together with some values in between (0.1, 0.2, ..., 0.9). All settings are evaluated by taking the average of the rrmse on the 20 datasets presented in Section 4. Then the best per- forming parameter is used for further inspection. There- fore, an area around this parameter is inspected in more 6Note that the number of disjunct values for an attribute in the data is rather small. 85 Figure 2: Parameters over rrmse for both folds of the tuning datasets 0.85 0.875 0.9 0.925 0.95 0.975 1 1.025 1.05 1.075 1.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Parameter α R R M SE (a) tuned on folds 1 0.885 0.91 0.935 0.96 0.985 1.01 1.035 1.06 1.085 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Parameter α R R M S E (b) tuned on folds 2 detail. There are several choices to do this, but we de- cided to evaluate 6 parameters around the best one. Those are distributed equidistantly around the best parameter with decreasing the step size from 0.1 to 0.01. This procedure is executed recursively, so in the next step the 6 parameters around the next best value are tested. The search stops if the rrmse improvement falls below a threshold of t = 0.0005. This choice was arbitrary but we believe that the effort that has to be made to narrow down the parameter for the next step of the search procedure is too high compared to the performance gain the next execution may yield. Figure 2 shows a graphical interpretation of the search for both experiments. For both of them very low parameter settings result in bad performance. When the parameters are increased the performance becomes better as long as the optimal setting is reached. After that it decreases again. For the parameters that are optimized on the first folds of the datasets (Figure 2 (a)) the curve shows some fluc- tuations in the part located left of the best parameter. In spite of this behavior the curve depicted in (Figure 2 (b)) is monotonically decreasing in this area. For parameter set- tings that are bigger than the best parameter the curve in the left figure is now showing a monotone increase whereas it shows more fluctuations when the parameter is tuned on the second folds of the partitioned datasets. Interestingly, the best parameters are very similar in both experiments. This means that the parameters are stable among different splits of the datasets. On the first folds of these datasets the best parameter lies at 0.59 and on the second folds it was 0.591. For the first folds the parameter 0.591 lacks only 0.007 behind in terms of rrmse. For the second folds the difference in performance was 0.001. Assumed that the best parameter lies somewhere in the region of 0.6, consistency should be preferred over cover- age for regression rules. This also holds for classification rules where the preference of consistency is even stronger than in regression. Nevertheless it is an interesting result that the evaluation of a rule’s quality follows similar stan- dards in classification and in regression. 6 Results 6.1 Splitpoint processing Table 2 shows a comparison of the runtime of 2 different splitpoint methods. At first, 3 equidistant splitpoints per attribute were used. Then, 3 clustered splitpoints were em- ployed. Evaluating all splitpoints was too costly7. All run- 7This is due to some huge datasets. Table 2: Runtime of different splitpoint methods on the test set method runtime (in sec.) 3 equidistant splitpoints 2625.4 3 clustered splitpoints 1234.3 times depicted in Table 2 are the averages of 10 indepen- dent runs on a dual Pentium 4 2.8 GHz processor with 2 GB RAM on the 9 datasets used for testing (cf. Section 4). As can be seen in Table 2 the clustered splitpoint compu- tation is more efficient than the equidistant method. At first sight this may appear contrary to what could be expected. Due to the much more simpler computation of equidistant splitpoints this method should be faster than the clustering method. But note that this evaluation was done by letting the whole algorithm run on the 9 test datasets. Not sur- prisingly the quality of the equidistant splitpoints is worse compared to the clustered splitpoints. This results in a sig- nificantly higher number of candidate rules that have to be evaluated during the search for the best rule which can be drastically reduced by using clustered splitpoints. 6.2 Comparison with other systems on the tuning datasets The main focus of the comparison is how well the algo- rithm performs against other regression algorithms. Table 3 gives an overview of the different algorithms compared to each other on the two folds using the rrmse. From now on the tuned algorithm is referred by the name SeCoReg. There are 4 other algorithms that are all implemented in weka [Witten and Frank, 2005] which were used to com- pare our system with. Clearly, some of them are much more complex than our rather simple algorithm8. On the other hand most of them employ more complex models, i.e., hy- perplanes like the multilayer perceptron (MLP) or support vectors like the SVMReg. The linear regression (Linear Reg.) is also a rather simple algorithm that nevertheless employs quite a good trade-off between runtime and error. M5Rules uses rules to explain the data. These rules predict linear models which makes the algorithm much more flexi- ble because each rule is able to map the examples on many different outcome values. The parameters of all weka algorithms were left at de- fault values. The reasons to select these 4 algorithms were that our implementation had to prove that it is comparable 8Note that the algorithm neither has a pruning functionality nor an optimization phase. 86 Table 3: Results in terms of average rrmse for different algorithms on the tuning datasets folds 1 folds 2 M5Rules 0.7425 0.8058 Linear Reg. 0.8145 0.9116 MLP 1.0154 1.3890 SVMreg 0.7917 0.8500 SeCoReg (folds 1) 0.8736 0.9291 SeCoReg (folds 2) 0.8976 0.8903 Figure 3: Comparison of all algorithms against each other with the Nemenyi test. Groups of algorithms that are not significantly different (at p = 0.05) are connected. SeCoReg MLP SVMreg Linear Reg. Critical Difference 1.02.03.04.0 M5Rules (a) folds 1 SeCoReg MLP SVMreg Linear Reg. Critical Difference 1.02.03.04.0 M5Rules (b) folds 2 in terms of error to other state-of-the-art systems. Another reason to select these particular algorithms for benchmark was the lack of freely available regression rule learning al- gorithms. The only free system we found was REGEN- DER and a comparison is given in Section 6.3. In Table 3 the results of all algorithms on the two folds of the tuning datasets are displayed. Results of both derived SeCoReg-algorithms are shown together with their perfor- mance on the data sets on which they were tuned (in italics). Not surprisingly both variants of the algorithm that were tuned on the respective folds are better than using them on the left-out folds. The ranking of the algorithms is similar on both experimental variants. The best one was M5Rules followed by the SVMreg and the linear regression. The SeCoReg was ranked on the 4th place in both experiments (by average rrmse), only slightly behind the linear regres- sion (lacking 0.0831 behind on the first folds and 0.0175 on the second folds). The Multilayer Perceptron had the worst performance with a rather big gap to the next better algorithm. Figure 3 shows CD-charts for both experiments. Note that the figure displays ranks averaged on all datasets. Only the algorithm M5Rules was significantly better than the SeCoReg in both cases. 6.3 Comparison with other algorithms on the test sets To validate the results on completely different datasets the algorithm was also tested on 9 independent test sets (cf. Section 4). This step is necessary to make sure that the tuning datasets, even though they were split into two dis- junct folds, were not overfitted during the parameter tuning phase. Table 4 displays the results in terms of rrmse on the test databases for all of the 4 weka algorithms and the two configurations of the SeCoReg-learner. The ranking of the Table 5: Results in terms of rrmse compared to RegENDER on 7 datasets of the test set algorithm avg. rrmse avg. rank SeCoReg (folds 1) 0.8154 3.00 SeCoReg (folds 2) 0.8538 3.13 RegENDER (10 rules) 0.9008 3.88 RegENDER (100 rules) 0.9291 3.88 RegENDER (# rules from folds 1) 0.9221 3.75 RegENDER (# rules from folds 2) 0.9034 3.38 algorithms differs slightly compared to the results on the 20 datasets. Hence, on the test sets the SVMreg performs best followed by the M5Rules-system. On the third place the first SeCoReg-learner appears. It was only slightly worse in performance compared to the M5Rules-learner. The next best algorithm is the second SeCoReg-learner which has achieved a marginal better rrmse than the linear regression. As in the previous experiments the multilayer perceptron was the worst algorithm. Thus, on the test sets the tuned SeCoReg-algorithm achieved better results than in the previous experiment. Here, the best configuration of the algorithm is ranked in third place. Note that the dataset meta shows huge stan- dard deviations for some algorithms (M5Rules, linear re- gression and MLP). We attribute this to the separation of the data into the 10 folds of the cross validation. Additionally to the error measurements a Friedman-Test was employed like in the previous Section (cf. Section 6.2). Contrary to the prior results, the Friedman-Test was not re- jected at a p-value of 0.05 (the critical F-value was 2.196 but to reject the test it had to be bigger than 2.492). It would have been rejected at a p-level of 0.1, but this was not sig- nificant enough to include these results in the paper. For this reason the Nemenyi-Test could also not be done on the test sets. Practically, this means that the SeCoReg algorithm does not differ significantly from the 4 weka algorithms at a significance level of 0.05. Table 5 shows a comparison to RegENDER [Dem- bczyn´ski et al., 2008]. The dataset auto-horse contains missing class values which cannot be handled by RegEN- DER. Therefore, this dataset was left out. In addition the results on the dataset meta showed strong fluctuations as mentioned before. For this reason this dataset was also left out. RegENDER has a parameter to specify the number of rules in the ensemble. To make a choice the algorithm was tested with 10 and 100 rules and with the same number of rules the two SeCoReg variants had found on each test set. Clearly, using more rules will result in a lower error (cf. [Dembczyn´ski et al., 2008]) but we think it is fair to run the algorithm with the same number of rules as used in the SeCoReg-learner. The SeCoReg-algorithm was slightly better in average rrmse and the average rank was also better. Nevertheless, a Friedman Test was rejected (p = 0.05) but the Nemenyi Test showed that all algorithms were in the same equiv- alence class (the critical distance extends over all algo- rithms) and therefore do not differ statistically significant. To sum up, both tuned variants of the presented algo- rithm are not able to beat state-of-the-art systems. They are rather situated in the middle of the performance of the other algorithms (this holds at least for the test sets). Es- pecially on the 9 test sets it became clear that the SeCoReg rule learners are able to achieve a performance compara- ble to the results of the 4 weka algorithms and RegENDER. 87 Table 4: Results in terms of rrmse for different weka algorithms and the SeCoReg-learners on the test set dataset SVMreg M5Rules Linear Reg. MLP SeCoReg (tuned on folds 1) SeCoReg (tuned on folds 2) auto-horse 0.32± 0.08 0.37± 0.14 0.32± 0.11 0.34± 0.10 0.52± 0.18 0.61± 0.11 auto93 0.66± 0.12 0.58± 0.19 0.67± 0.20 0.57± 0.19 0.65± 0.17 0.85± 0.29 cloud 0.39± 0.12 0.42± 0.16 0.40± 0.13 0.62± 0.33 0.61± 0.19 0.67± 0.15 delta-elevators 0.61± 0.01 0.60± 0.01 0.61± 0.01 0.63± 0.01 0.78± 0.03 0.77± 0.03 meta 0.92± 0.08 1.86± 1.58 2.33± 1.72 1.40± 0.90 1.00± 0.02 1.01± 0.03 r wpbc 1.03± 0.16 1.14± 0.19 1.04± 0.13 2.20± 0.56 1.35± 0.20 1.27± 0.18 stock 0.37± 0.05 0.14± 0.03 0.36± 0.04 0.20± 0.04 0.25± 0.03 0.26± 0.04 veteran 0.93± 0.15 1.23± 0.61 1.07± 0.36 3.01± 1.78 1.09± 0.22 1.21± 0.33 winequality-red 0.82± 0.03 0.81± 0.03 0.81± 0.03 0.95± 0.08 0.98± 0.09 0.95± 0.04 averages 0.6739 0.7942 0.8456 1.1017 0.8040 0.8438 Due to the rather simple design of the SeCoReg-algorithm these results seem to be promising. 7 Conclusion and further work In this paper a new rule learning algorithm for the task of regression was presented. It was shown that the algo- rithm performs comparable to different state-of-the-art al- gorithms implemented in weka and RegENDER, a rather new algorithm. A new splitpoint generation method was introduced. This method proved to support the quality of candidate rules and even results in lower runtime compared to naive methods like the generation of equidistant splitpoints. Nev- ertheless, the number of generated candidate rules directly depends on the number of splitpoints. But as shown in the experiments at least for one configuration of the algorithm a number of 3 splitpoints per numerical attribute was enough. A novel rule learning heuristic was introduced that clearly improves the algorithms performance due to its flex- ibility in weighting the error of a rule with its coverage. An optimal setting for this regression rule heuristic was pre- sented and it proved to be stable since the parameter values are nearly the same. An interesting observation is that, as known from classification, in regression the rules consis- tency also should be preferred over its coverage. A promising path to optimize the algorithm would be to adapt the advantages of algorithms like M5Rules which predicts linear models in the head of each rule. On the one hand the performance of the algorithm should be drastically improved when using linear models instead of single target values. On the other hand, much of the interpretability of the rule set would be lost when doing so. Acknowledgments This research was supported by the German Science Foundation (DFG) under grant no. FU 580/2-2. References [Asuncion and Newman, 2007] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007. [Clark and Niblett, 1989] Peter Clark and Tim Niblett. The CN2 induction algorithm. Machine Learning, 3(4):261–283, 1989. [Cohen, 1995] William W. Cohen. Fast effective rule induction. In ICML, pages 115–123, 1995. [Dembczyn´ski et al., 2008] Krzysztof Dembczyn´ski, Wojciech Kotłowski, and Roman Słowin´ski. Solving regression by learn- ing an ensemble of decision rules. In ICAISC ’08, pages 533– 544, Berlin, Heidelberg, 2008. Springer-Verlag. [Demsar, 2006] Janez Demsar. Statistical comparisons of classi- fiers over multiple datasets. Machine Learning Research, 7:1– 30, 2006. [Demsˇar, 1999] D. Demsˇar. Obravnavanje numericnih proble- mov z induktivnim logicnim programiranjem. Master’s thesis, Faculty of Computer and Information Science, University of Ljubljana, Slovenia, 1999. In Slovene. [Friedman and Popescu, 2008] Jerome H. Friedman and Bog- dan E. Popescu. Predictive learning via rule ensembles. Annals Of Applied Statistics, 2:916, 2008. [Fu¨rnkranz and Flach, 2005] Johannes Fu¨rnkranz and Peter Flach. ROC ’n’ rule learning – Towards a better understanding of covering algorithms. Machine Learning, 58(1):39–77, 2005. [Fu¨rnkranz, 1999] Johannes Fu¨rnkranz. Separate-and-conquer rule learning. Artificial Intelligence Review, 13(1):3–54, February 1999. [Holmes et al., 1999] Geoffrey Holmes, Mark Hall, and Eibe Frank. Generating rule sets from model trees. In Twelfth Aus- tralian Joint Conference on Artificial Intelligence, pages 1–12. Springer, 1999. [Janssen and Fu¨rnkranz, 2010a] Frederik Janssen and Johannes Fu¨rnkranz. On the quest for optimal rule learning heuris- tics. Machine Learning, 78(3):343–379, March 2010. DOI 10.1007/s10994-009-5162-2. [Janssen and Fu¨rnkranz, 2010b] Frederik Janssen and Johannes Fu¨rnkranz. Separate-and-conquer regression. Technical Re- port TUD-KE-2010-01, TU Darmstadt, Knowledge Engineer- ing Group, 2010. [Karalicˇ and Bratko, 1997] Aram Karalicˇ and Ivan Bratko. First order regression. Machine Learning, 26(2-3):147–176, 1997. [Quinlan, 1992] Ross J. Quinlan. Learning with continuous classes. In 5th Australian Joint Conference on Artificial In- telligence, pages 343–348, Singapore, 1992. World Scientific. [Torgo and Gama, 1996] Lus Torgo and Joo Gama. Regression by classification. In In Proceedings of SBIA96, Borges, pages 51–60. Springer-Verlag, 1996. [Torgo, 1995] Luis Torgo. Data fitting with rule-based regres- sion. In In Proceedings of the 2nd international workshop on Artificial Intelligence Techniques (AIT’95, 1995. [Zˇenko et al., 2005] Bernard Zˇenko, Saso Dzˇeroski, and Jan Struyf. Learning predictive clustering rules. In In 4th Intl Workshop on Knowledge Discovery in Inductive Databases: Revised Selected and Invited Papers, volume 3933 of LNCS, pages 234–250. Springer, 2005. [Wang and Witten, 1997] Y. Wang and I. H. Witten. Induction of model trees for predicting continuous classes. In Poster pa- pers of the 9th European Conference on Machine Learning. Springer, 1997. [Weiss and Indurkhya, 1995] Sholom M. Weiss and Nitin In- durkhya. Rule-based machine learning methods for functional prediction. Journal of Artificial Intelligence Research, 3:383– 403, 1995. [Witten and Frank, 2005] Ian H. Witten and Eibe Frank. Data Mining — Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, 2nd edition, 2005. 88 Kernelized Map Matching for Noisy Trajectories Ahmed Jawad , Kristian Kersting Knowledge Discovery Dept., Fraunhofer IAIS, 53754 Sankt Augustin, Germany ahmed.jawad@iais.fraunhofer.de , kristian.kersting@iais.fraunhofer.de Abstract Map matching is a fundamental operation in many applications such as traffic analysis and location-aware services, the killer apps for ubiquitous computing. In the past, several map matching approaches have been proposed. Roughly speaking they can be categorized into four groups: geometric, topological, probabilis- tic, and other advanced techniques. Surprisingly, kernel methods have not received attention yet although they are very popular in the machine learning community due to their solid mathemat- ical foundation, tendency toward easy geometric interpretation, and strong empirical performance in a wide variety of domains. In this paper, we show how to employ kernels for map matching. Specifically, ignoring map constraints, we first maximize the consistency between the similarity measures captured by the kernel matrices of the trajectory and relevant part of the street map. The resulting relaxed assignment is then ”rounded” into an hard assignment fulfilling the map con- straints. On synthetic and real-world trajecto- ries, we show that kernels methods can be used for map matching and perform well compared to probabilistic methods such as HMMs. 1 Introduction Map matching is a fundamental operation for many real- world applications that are currently changing how we as a society use computers in daily life. Following [Mitchell, 2009]: after decades of analysing ”historical” data, ’location-based services’ are a major driving force for analysing ”real-time” and ”real-space” data that record personal activities, conversations, and movements in an attempt to improve human health, guide traffic, and ad- vance the scientific understanding of human behaviour. Therefore, it is not surprising that map matching — the problem of mapping a temporal sequence of coordinates onto a given network — has received a lot of attention. Roughly speaking existing approaches for map matching can be categorized into four groups: geometric, topologi- cal,probabilistic, and other advanced techniques. Surpris- ingly, however, kernel methods such as support vector ma- chine and Gaussian processes have not received attention yet although they are very popular in the machine learn- ing community due to their solid mathematical foundation, tendency toward easy geometric interpretation, and strong empirical performance in a wide variety of domains. In a Figure 1: illustration of the map matching problem: Given a graph G = (V,E), denoted as gray edges, and a trajec- tory T = {p1, p2, ..., pn}, denoted as red squares, find the ground truth path (dashed line) starting from T . nutshell, kernel methods first process a dataset into a kernel matrix that roughly expresses the idea that two data points are ”equivalent as far as some function f of the data can tell”. By representing the data in terms of a kernel ma- trix, the data can be of various types, and also heteroge- neous types such as trees and graphs. This makes kernel ap- proaches very flexible. In a second step, a variety of kernel algorithms that have been developed can be used to analyse the data, using only the information contained in the kernel matrix. Kernels are, also, readily extendible therefore ev- ery kernel matrix provides an opportunity to integrate its knowledge with existing kernels in the field. An investiga- tion of using kernel methods for map matching motivated by their well-known strength was the seed that grew into our proposal for kernelized map matching. Through this article, we make the following contributions in map match- ing research: 1. Establishing a link between ’map matching’ and ’ker- nel methods’. 2. Specifically building and investigating an instantiation of this link: a simple yet effective kernelized map matching approach, called ’Kernelized Map Match- ing’ (KMM). 3. To do so, we also develop a simple kernel to capture spatio-temporal correlations under one umbrella. Specifically, triggered by the observation that matching a trajectory of coordinates would be easy if the observed co- ordinates were noise-free — the coordinates would simply constitute the solution — one may propose to treat the map matching problem as a regression problem. That is, we treat a trajectory as a function t for which we observe a 89 (a) KMM: over a roundabout (b) KMM: over grid structure (c) KMM: Robustness to noise Figure 2: illustrations of KMM’s performance on challenging real-world situations. The red points show the input trajectory and the blue points the output path by KMM. In all three cases, KMM recovers the exact groundtruth paths. (coloured) sequence of noisy values, the coordinates t(i) +  at in- puts i = 1, 2, 3, . . . , k, the temporal order of coordinates. The task is now: estimate the noise-free function t from the noise observations. In contrast to most traditional regres- sion tasks, however, outputs are structured due to the phys- ical constraints in the world and in turn there are non-linear dependencies among coordinates. Although kernel meth- ods are powerful tools for modelling non-linear dependen- cies, and hence seem to be relevant for map matching, most kernel (regression) models focus on the prediction of a single output. Although generalizations to multiple out- puts can often be achieved by training independent models for each one or tying parameters across dimensions, this fails to account for output correlations [Weston and Vap- nik, 2002]. Consequently, we propose a different approach, namely to ”embed” the output of F , i.e., the coordinates of trajectory into the same space as the trajectory and the net- work and in turn reducing the noise while still capturing the multi-output, non-linear dependencies present in trajecto- ries. Specifically, ignoring map constraints, we first embed the trajectory and hence reduce noise. The resulting relaxed assignment is then ”rounded” into an hard assignment ful- filling the map constraints. On synthetic and real-world trajectories, we show that this approach, called kernelized map matching, can be used for map matching and performs well compared to state-of-the-art hidden Markov models. Fig. 2 shows KMM performance over some of the tasks considered difficult among map matching experts [Newson and Krumm, 2009]. We proceed as follows. We start off by reviewing related work and mathematical background. In Section 3, we then introduce kernelized map matching. Before concluding, we present the results of our experimental evaluation in Sec- tion 4. 2 Related work: The work presented in this paper is conceptually built upon three areas of research, namely map matching, structural embeddings and kernel methods. We briefly describe the background work in each of these areas respectively. Quddus et al. [Quddus et al., 2007] provide a good sur- vey of existing map matching algorithms. Following them, existing approaches can be broadly categorized into four categories (or a combinations of them): geometric [Fox et al., 2003], topological , probabilistic [Quddus et al., 2007] and other advanced techniques [Taylor et al., 2001]. Specifically, geometric map matching approaches utilize the shape of the spatial road network without consider- ing the continuity or connectivity of it. Topological ap- proaches use the connectivity and other topological fea- tures to restrict the candidate matches for solution points. Finally, probabilistic and advanced approaches are referred to as those ”using more refined concepts such as a Kalman Filter, a fuzzy logic model or the application of hidden Markov model”. Another dimension along which map matching approaches can be distinguished is ”incremental vs. global” [Lou et al., 2009]. In the present paper, we take the traditional, data-driven view of map matching (refer to [Lou et al., 2009] for more details) which is defined as Given a trajectory T and a street network G, find a path P in G that matches T with its real or ground truth path. Next to traditional map matching approaches, our approach builds upon ideas from structural embeddings [Quadrianto et al., 2009], [Guo et al., 2008]. The idea is to view trajectories as a special type of graph called Euclidean graph [Pemmaraju and Skiena, 2003]. Definition 1 (Euclidean graph): A Euclidean graph G(V,E) is a graph in which 1. the vertices represent points in the Euclidean plane R2, and 2. the edges are assigned lengths equal to the Euclidean distance between those points, i.e., a straight line be- tween corresponding vertices. In order to view a trajectory as an Euclidean graph, we need to capture the spatial and temporal aspects of the trajectory inside it. The spatial aspect of a trajectory is already cap- tured in the Euclidean graph through R2 coordinates of the graph’s vertices. In order to capture the temporal aspects of a trajectory, we time stamp the vertices of the graph in the order of the trajectory. The poly-line structure of street network graph with its vertices embedded in R2 is, indeed, an Euclidean graph. To make it explicit, we can add ad- ditional vertices to every corner of the lines constituting a road segment. Consequently, map matching, can be viewed as a problem of ”matching, i.e., finding similarity between two Euclidean graphs”. Indeed, Euclidean graphs can actually be found in many machine learning problems such as map matching [Brakat- soulas et al., 2005], shape analysis, protein structure analy- sis, time series similarity analysis, among others. However, —to the best of our knowledge— ,the problem of match- ing Euclidean graphs, i.e., ”finding similarity between two Euclidean graphs” has not been considered yet. Instead, 90 the problem has been approximated by casting it into a tra- ditional ”embedding” respectively ”graph matching” prob- lem and in turn dropping important information. Specif- ically, graph matching is typically formulated as a prob- lem where we only seek a node-to-node matching between the input and output graphs. Euclidean graph matching is different as the nodes of the input graph (trajectory) can be mapped to any point lying on the edges of the output graph (street network graph).Embedding [Lee and Verley- sen, 2007], on the other hand, is a generic problem: find a set of points in a (typically) low-dimensional space having similar properties and relationships as the points in the orig- inal input space. So far, however, it has only been consid- ered for matching graphs in the traditional sense discussed above, see e.g. [Lee and Verleysen, 2007], [Guo et al., 2008; Quadrianto et al., 2009].Hence, this approach suf- fers from the same issues as the standard graph matching approach. Nevertheless, they employ kernel methods. We use a simple embedding technique called Multidimensional scaling(MDS) [Cox and Cox, 2000] defined as following: Definition 2 Multidimensional Scaling(MDS): MDS is an embedding technique which uses the so-called stress function (the sum of squared differences between the simi- larity criteria of input and latent spaces) to come up with embedding X for input Y . Lets say Kx and Ky denote the similarity matrices for output X and input Y . Then the ob- jective function Fo using MDS can be described as follows: Fo = argmin X ∑ i,j (Kx(i, j)−Ky(i, j))2 (1) Embedding approaches i.e, MDS and other, are called ’ker- nelized embedding’ [Guo et al., 2008] when they use kernel matrices as similarity criteria. Kernel Methods help with embedding in special cases because they are more suitable to non-vectorial data sets (text, images, protein sequences, graphs, trajectories, etc..) [Shawe-Taylor and Cristianini, 2004]. Our method is inspired by these techniques but gen- eralizes them to the Euclidean graph case. 3 Kernelized Map Matching Recall from the introduction that map matching would be easy if the observed coordinates were noise-free. In this case the observed coordinates would simply constitute the solution. In general, we cannot expect to reduce the noise completely. Consequently, we propose a two steps ap- proach: (1) To reduce the noise, embed the trajectory from R2 into R2 using kernel methods to capture the multi-output, non- linear dependencies present in trajectories. (2) To account for remaining noise, ”round” the embed- ding into an hard matching. We will now explain each of the steps in turn. 3.1 Euclidean Graph Matching We proceed with some notation and problem description. G(V,E) denotes a trajectory (input graph) where vertices are indexed by time t , i.e. V = {vt}|V |t=1, vt ∈ R2, while G ′ (V ′ , E ′ ) denotes street network graph (output graph) where V ′ = {v′i}|V ′ | i=1 , vi′ ∈ R2 are the set of nodes for street segments and E ′ = {e′i,j}|V ′ | i,j=1, e ′ i,j ∈ {θvi + (1 − θ)vj , θ ∈ [0, 1]} , if e′i,j corresponds to a street segment otherwise it is empty. Notice that e ′ i,j is defined as a con- vex combination of vertices, i.e a straight line between the nodes .In general, a street segment is defined as a poly-line however we can always divide it into a set of straight lines considering each corner of lines as a vertex to match our representation. Figure 3 illustrates the meaning of a street segment in V ′ as a convex combination. Φ denotes the Figure 3: An edge in G = (V,E) as a convex combination of two vertices vector of mappings between a trajectory G and street net- work G ′ . Unlike traditional graph matching where nodes of one graph are only matched to nodes of another graph, we need a mapping Φ which matches nodes of our input graph(trajectory) to any location over street segments in E ′ : i.e. Φt = Φ : vt 7→ e′ × [0, 1] where the interval [0, 1] specifies a value of θ pointing the exact location of mapping over a street segment in E′. The truth value of vertex assignments to street segments can be stored in a binary matrix ∆, i.e. ∆ ∈ {0, 1}|V |×|E ′ | subject to ∆>1 = 1. Here 1 denotes a column vector of all ones. The last constraint enforces that each trajectory point can only be mapped to one street edge at a time. Definition 3 (Map Matching) Find a correspondence Φ between vertices of trajectory G and locations on street network G′ such that the two matched sets, V and Φ, look most similar according to an objective criteria Fo. The problem is solved through finding an assignment of V to points lying on the edge set E′ that minimizes the criteria Fo. The function Fo is of special interest and should have a form which preserves the relationships among input points while translating them to output space. We further assume that Fo is a decomposable function of a summation of ba- sis functions, denoted by fΦi,Φj .Thus to minimize Fo, we need to minimize the individual entries of summation. We also introduce the assignment matrix ∆ in Fo. The assign- ment matrix ∆, enforces the Euclidean graph (or street net- work) constraints on the output i.e Fo = argmin Φ |E′ |∑ l |V |∑ i,j ∆i,l · fΦi,Φj s.t ∆ ∈ {0, 1}|V |×|E ′ | ,∆>1 = 1 (2) Eq. (2) describes the map matching problem as a so-called integer linear program (ILP) which are generally known to be NP-hard except for a few classes of them. Such mathe- matical programs have a discrete and a continuous part. In our case, the discrete part chooses the correct combination of trajectory point (v ∈ V ) versus street segment (e′ ∈ E′). The total number of combinations is V E ′ , which easily gets intractable even for a modest number of trajectory points and street network segments. Relaxation is a standard tech- nique to reduce the complexity of ILP problems. In relax- ation, we drop the discrete part of the problem to make it a standard linear programming problem where the opti- mization can be carried out in polynomial time. The result of the optimization is then rounded into a hard assignment 91 fulfilling the discrete constraints to get an approximate so- lution. In our case, discrete constraints amount to street network constraints and relaxation means ignoring these constraints. Eq. (3) describes the unconstrained Fo Fo = argmin Ψ |V |∑ i,j fΨi,Ψj (3) Notice that we have changed the output vector Φ to inter- mediate output Ψ. The mapping Ψ is different from Φ as it maps a a vertex vt to R2: i.e Ψt = Ψ : vt 7→ R2.Now the solution of the optimization will be an approximation of the path used by trajectory instead of the original path. Afterwards we can convert this approximate path into the street network path through a rounding step. Hence, we provide a two step framework for the solution of Eq. (2) • Optimize the relaxed objective function to approxi- mate the trajectory path • Provide a rounding scheme for assignment of step a output to street network. We proceed by describing the details of these two steps in turn. Optimization Step Eq. (3) describes Fo as a summation of entries fΨi,Ψj . Now we come to details of an individual entry fΨi,Ψj . For this purpose, we define two Kernel matrices KG and KG′ , where KGi,j is the kernel function between trajectory ver- tices vi and vj : i.e KGi,j = kG(vi, vj) and KG′i,j is the kernel function between the mappings Ψi,Ψj of the ver- tices vi and vj i.e KG′i,j = kG ′ (Ψi,Ψj). The widths of the Kernels KG and KG′ are denoted by σG and σG′ . Now we define fΨi,Ψj as the ’difference of kernels’ function. fΨi,Ψj = (kG(vi, vj)− kG′ (Ψi,Ψj))2 (4) Now we can substitute the value of fΨi,Ψj in Eq. (3). Fo = argmin Ψ |V |∑ i,j (kG(vi, vj)− kG′ (Ψi,Ψj))2 (5) Eq. (5) comes from an embedding technique known as Multidimensional scaling described in Eq. (1). We further note that Eq. (5) is like a regression equation with multiple outputs where we want to preserve the correlation among inputs during our structured prediction process and it can be used for embeddings across different spaces and struc- tures, however our case is a special case where the input and output spaces are the same. To encode our prior knowl- edge that the solution of the embedding lies in the spatial neighbourhood perturbed by Gaussian noise, we add a reg- ularization term which fuses the input and output space into one. We propose a kernel function kGG′ between the re- spective points of our input and output graphs and define it as kGG′ (λ, σN , i) = e −((vi−Ψi)2−λσ2N )2/2σ2N (6) where λ is a stiffness parameter of the kernel function while σN , the kernel width, is the estimated standard deviation of noise in trajectory. We finalize our objective function Fo as following: Fo = argmin Ψ ∑ ij (KGi,j −KG′i,j ) 2 − ∑ i kGG′ (λ, σN , i) (7) Eq. (7) is the final objective function which we are using in kernelized map matching. However alternative embed- ding approaches can also be used in principal. We can take the derivative of Fo in Eq. (7) w.r.t Ψ and can find the an- swer. We use ’scaled conjugate gradient’ algorithm for op- timization [Shewchuk, 1994]. For assignment purposes, we apply a rounding scheme on the result of continuous opti- mization. We implement the optimization step in sliding window style of width kw, because it makes the solution real time, localized and efficient. Spatio-temporal Kernel over Euclidean Distances The kernels KGi,j and KG′i,j , that we have used are similar to RBF kernels. The only difference with an RBF kernel is that we use ’the sum of euclidean distances for all successive pairs of points between i and j’ instead of using ’euclidean distance between i and j’ directly. This little trick allows us to capture the spatial and temporal correlation among trajectory points in our kernel matrices. To elaborate further: for two points pi and pj in a trajectory where j > i, the sum of successive euclidean distances is denoted by δi,j , and is defined as: δi,j = ∑ i6m 1 make the standard deviation of Ψ more than the trajectory while λ < 0.5 is too restrictive on output. In our experiments, have chosen λ=0.6. kw is the sliding window width for execution of optimization step. Figure 6 shows the results for kw = {3, 4, .., 6} over a synthetic data set. We have chosen kw = 5 for most of our experiments. According to our observation, kw > 6 proves a bit time consuming dur- ing optimization while kw < 3 is not sufficient to capture the local geometry. α- is the parameter ’governing rela- tionship between Graph and Euclidean distance’. We set it to 1.5 which means that Graph distance should not be greater than 1.5×Euclidean distance. A reasonable range is 1 ≤ α ≤ 2. σKG and σKG′ are learned through opti- mization over a holdout sample of 20 consecutive input. 93 Algorithm 1: KMM Input : G,G′, KG,KG′ ,KGG′ ,KΩ,λ,kw, α,σN Output: Φ(V ) - The Output Path P // G,G′- Euclidean Graphs // KG,KG′ ,KGG′ ,KΩ -Graph Kernels // λ -Regularizer // kw- sliding window width // α- constant for Euclidean versus Graph distance validation for i← 1 : |V | − kw do foreach sliding window do compute KG,KG′ Ψ← Optimize Fo w.r.t Ψ i← 1 , iprev ← 1 , prevdist ← 0 while i < |V | do i← i+ 1 iprev ← max(1, iprev) minobj ←∞ if iprev = i− 1 then prevdist ← 0 condist ← d(Ψi,Ψi−1) else condist ← d(Ψi,Ψi−1) + prevdist + d(Ψiprev ,Ψiprev+1) for c ∈ CΦi do objval ← −e−(c−Ψi)2 .KΩi if objval < minobj then minobj ← objval Φi ← c if Ω(Φi,Φiprev ) > α× d(Ψi,Ψiprev ) then iprev ← iprev − 1 prevdist ← condist else Assign all points between i and iprev to shortest path between Φi and Φiprev Output path P by connecting consecutive Φ 4 Experimental Evaluation In this section we report the results from a series of exper- iments which we conducted in order to empirically investi- gate the following questions: (Q1) Can kernel methods be used for map matching? (Q2) If so, how do they perform compared to state-of-the- art methods? (Q3) Is kernelized embedding indeed reducing the noise? (Q4) Does the rounding step in KMM contribute to the er- ror reduction? To this aim, we implemented KMM in scientific Python running on a standard Intel-Quadcore 2GHz computer. Overall, we decided for two experimental setups. Our first experimental setup evaluates and compares KMM’s accuracy performance on a real-world dataset recently used in [Newson and Krumm, 2009] to evaluate an hidden Markov model based map matching approach and hence addresses Q1, Q2. To address Q4 we compare the perfor- mance of our rounding step with a randomized rounding step. The second setup investigates Q3 by comparing the result of KMM’s embedding step to baseline ”closest point on edge” using synthetically generated dataset. 4.1 Q1+Q2+Q4: Real-World Dataset In our first experiment, we compared KMM’s performance to the performance of Krumm and Newson’s recent hid- den Markov model based approach [Newson and Krumm, 2009]. To measure performance, we used the Route Mismatch Fraction measure already used by Newson and Krumm. Route Mismatch Fraction(RMF) is the total length of false positive and false negative route divided by length of original route [Newson and Krumm, 2009]: d+ = length of erroneously added route d− = length of erroneously subtracted route do = length of original route RMF = (d+ + d−) do We used Krumm and Newson’s dataset. It consists of a 50-mile route in Seattle sampled at 1 Hz, giving one tra- jectory of 7531 time stamped latitude/longitude pairs along with manually matched ground-truth path. The street net- work comprises around 150, 000 road segments. [Newson and Krumm, 2009] presented results for different sampling intervals and noise degradations of this data. We take six base cases where HMM model performance is good, i.e 5,10 seconds sampling intervals vs. 30,40,50 meters noise. Because we want to have a statistical comparison (average, significance,etc.) with HMM, we perform 25 experiments for comparison with each base value in the following way. For one sampling interval, say 10, we choose 5 different starting points from the initial 5 points of the trajectory and then sampled at the given rate. Following this pro- cedure, we prepared experimental datasets for each sample by adding 5 instances of noise for one standard deviation (e.g 30), i.e 25 datasets/combination = 5 samples ×5 noise instances. Fig. 5 summarizes the experimental results showing the RMF errors. As one can see, in 5 out of 6 cases KMM estimated a lower route mismatch fraction. The statistical significances of the results are shown in Table 1. In 4 out of 5 cases where we are better, the differences in mean values are significant (t-test, p = 0.05). Averaged over all experiments, a Wilcoxon test identifies KMM to be significantly better. To address Q4, we compared the performance of our rounding step with a randomized rounding step, i.e random assignment of embedding output Ψ to a street network point in -ball, over the real-world dataset described above. Table 2 summarizes the results. As one can see our rounding step always performs better, and in most cases outperforms the randomized rounding step by a margin of 4-10 percent in error. To summarize, the results clearly answer questions Q1 ,Q2 and Q4 affirmatively. 4.2 Q3: Synthetic Datasets In order to investigate how well the embedding step re- duces noise, we generate ground truth points with the help of synthetic data. To generate the data, we used Thomas Brinkhoff’s data generator [Brinkhoff, 2000]. It allows to generate trajectories according to some underlying road network for different speed and sampling time variation setting. Additionally, we assumed normal noise on the 94 0.05 0.1 0.15 0.20 2 4 6 8 10 12 14 Mismatch PD F Va lu es PDF Runs Mean Baseline (a) NL 30, SR 5 0.2 0.3 0.4 0.50 2 4 6 8 10 Mismatch PD F Va lu es PDF Runs Mean Baseline (b) NL 40, SR 5 0.3 0.4 0.5 0.60 2 4 6 8 10 12 Mismatch PD F Va lu es PDF Runs Mean Baseline (c) NL 50, SR 5 0.1 0.15 0.20 5 10 15 Mismatch PD F Va lu es PDF Runs Mean Baseline (d) NL 30, SR 10 0.05 0.1 0.15 0.2 0.25 0.30 2 4 6 8 10 12 Mismatch PD F Va lu es PDF Runs Mean Baseline (e) NL 40, SR 10 0.2 0.3 0.4 0.50 2 4 6 8 Mismatch PD F Va lu es PDF Runs Mean Baseline (f) NL 50, SR 10 Figure 5: Route Mismatch Fraction of KMM vs. HMM for different noise level (NL, in meters) and sampling rates (SR, in seconds). We ran 150 experiments i.e 25 experiments/per NL-SR setting. One blue dot denotes the route mismatch fraction for an experiment, blue graphs the estimated normal distributions, red lines the means, and dashed green lines the route mismatch fraction for an HMM as reported by Newson and Krumm. As one can see, in 5 out of 6 cases KMM estimates a lower route mismatch fraction. In 4 out of 5 cases, the differences in mean values are significant (t-test, p = 0.05). Averaged over all experiments, a Wilcoxon test identifies KMM to be significantly better. (Best viewed in color) 20 40 60 80 100 20 40 60 80 100 RMS Baseline R M S KM M 20 30 40 50 60 70 80 90 100 Regression Line Equilibrium (a) K=3 20 40 60 80 100 20 40 60 80 100 RMS Baseline R M S KM M 20 30 40 50 60 70 80 90 100 Regression Line Equilibrium (b) K=4 20 40 60 80 100 20 40 60 80 100 RMS Baseline R M S KM M 20 30 40 50 60 70 80 90 100 Regression Line Equilibrium (c) K=5 20 40 60 80 100 20 40 60 80 100 RMS Baseline R M S KM M 20 30 40 50 60 70 80 90 100 Regression Line Equilibrium (d) K=6 Figure 6: Scatter plots of the root mean squared(RMS) error of KMM and of the baseline, closest point on edge, for different window sizes (K) on the Oldenburg dataset for different noise levels (20, 30, 40, . . . , 100) as denoted by the numbers associated with the dots. For better comparison with the equilibrium (solid straight line), we also show the linear regression line of the values. As one can see KMM performs much better with increasing noise levels. The regression lines have consistently a smaller slope than the ”equilibrium” line. The turning point is around noise level 25. (Best viewed in color) generated observations. it is well known navigation sys- tems produce noise that is normally distributed with aver- age of noise (σN ) ranging from 10-100 meters. For in- stance, Krumm and Newson [Newson and Krumm, 2009] discussed the different types and amount of noise in real- world movement data and have shown that they can be de- scribed well by a Gaussian shape. Specifically, we generated 100 trajectories of average length 50 from the street network of Oldenburg, Germany. Then, we added noise σN for σN = 20, 30, . . . , 100. This resulted in an overall set of 900 trajectories. As baseline for comparison we used a ”project to the closed point in the street network” approach. To get a fair comparison, we also used the ”project to the closed point in the street net- work” as ”rounding” method for KMM. We report on the root means squared difference in meters achieved by the two methods. The results are summarized in Fig. 6. As one can see, in all cases the embedding indeed reduced noise considerably. Moreover, it performs better with increasing noise levels as the regression lines have consistently a smaller slope than the ”equilibrium” line. The turning point is around noise level 25. This clearly answers question Q3 affirmatively. To summarize, the results of our experiments indicate that kernel methods can indeed be used for map matching and achieve results comparable to state-of-the-art techniques. 5 Conclusion In this paper, we have recognized that kernel methods have an important role to play in map matching. Specifically, we have established — to the best of our knowledge — the first link between map matching and kernel methods and instan- 95 Statistics for sampling rate=5 Noise σerror student t Wilcoxon signed rank 30 0.029 1 0.00012 40 0.03992 1 0.0027 50 0.036 1 0.00122 Statistics for sampling rate=10 Noise σerror student t Wilcoxon signed rank 30 0.027 1 0.0004 40 0.034 0 0.047 50 0.054 1 0.00941 Table 1: Statistics table for comparison with HMM Noise Ratio KMM Error ERR SR=5 ERR SR=10 30 0.1463 0.1783 0.1480 40 0.3022 0.3221 0.2350 50 0.3739 0.4671 0.3979 Table 2: Comparison of average KMM error with average Randomized Rounding (ERR) error over different Noise Ratio and Sampling Rates (SR) tiated the link by developing an easy-to-implement kernel- ized map matching (KMM) approach. By accounting for spatial and temporal correlations among the trajectories us- ing kernels, map matching becomes less sensitive to noisy observations. This allowed us to employ a simple round- ing scheme to compute the hard assignments of trajectory points to points lying on the network. Our experimental re- sults on both simulated as well as real-world datasets show that kernel methods can indeed be used for map match- ing and can achieve better performance than state-of-the-art hidden Markov models. The link established between map matching and ker- nels provides many interesting avenues for future work; we have only started to explore it. Indeed, one should study alternative kernels and map features, cost functions and hyper-parameter optimization criteria. For instance, so-called Fisher kernels are kernels derived from hidden Markov models. In turn, we may utilize any HMM based map matching approach. Testing KMM within a real-world system tracking system is another interesting avenue. Prov- ing the hardness of map matching problems along with guarantees on approximation are interesting venues of fu- ture work. Finally, KMM directly generalize to the case of 3D trajectories. Acknowledgements The authors would like to thank Paul Newson and John Krumm for making their data available, to anonymous re- viewers for helpful suggestions and Novi Quadrianto for taming the notation and ideas in this paper . This work was supported by the Fraunhofer ATTRACT fellowship STREAM and a ”Higher Education Commission” (HEC)- Pakistan scholarship. References [Brakatsoulas et al., 2005] Sotiris Brakatsoulas, Dieter Pfoser, Randall Salas, and Carola Wenk. On map- matching vehicle tracking data. In VLDB, pages 853– 864. VLDB Endowment, 2005. [Brinkhoff, 2000] Thomas Brinkhoff. Generating network-based moving objects. In SSDBM, page 253, Washington, DC, USA, 2000. IEEE Computer Society. [Cox and Cox, 2000] Trevor F. Cox and M.A.A. Cox. Multidimensional Scaling, Second Edition. Chapman and Hall/CRC, 2 edition, 2000. [Fox et al., 2003] Dieter Fox, Jeffrey Hightower, Lin Liao, Dirk Schulz, and Gaetano Borriello. Bayesian filter- ing for location estimation. IEEE Pervasive Computing, 2(3):24–33, 2003. [Guo et al., 2008] Y. Guo, J. Gao, and P.W. Kwan. Twin kernel embedding. Pattern Analysis and Machine Intel- ligence, IEEE Transactions on, 30(8):1490–1495, 2008. [Lee and Verleysen, 2007] John A. Lee and Michel. Ver- leysen. Nonlinear dimensionality reduction. Springer, New York; London, 2007. [Lou et al., 2009] Yin Lou, Chengyang Zhang, Yu Zheng, Xing Xie, Wei Wang, and Yan Huang. Map-matching for low-sampling-rate gps trajectories. In GIS, pages 352–361, New York, NY, USA, 2009. ACM. [Mitchell, 2009] Tom Mitchell. Mining our reality. Sci- ence, 326(5960):1644–1645, 2009. [Newson and Krumm, 2009] Paul Newson and John Krumm. Hidden markov map matching through noise and sparseness. In GIS, pages 336–343, New York, NY, USA, 2009. ACM. [Pemmaraju and Skiena, 2003] Sriram Pemmaraju and Steven Skiena. Computational Discrete Mathematics: Combinatorics and Graph Theory with Mathematica . Cambridge University Press, 2003. [Quadrianto et al., 2009] Novi Quadrianto, Le Song, and Alex J. Smola. Kernelized sorring. In NIPS 21, pages 1289–1296. 2009. [Quddus et al., 2007] Mohammed A. Quddus, Washing- ton Y. Ochieng, and Robert B. Noland. Current map- matching algorithms for transport applications: State- of-the art and future research directions. Transportation Research Part C: Emerging Technologies, 15(5):312 – 328, 2007. [Shawe-Taylor and Cristianini, 2004] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Anal- ysis. Cambridge University Press, illustrated edition edi- tion, 2004. [Shewchuk, 1994] Jonathan R Shewchuk. An introduction to the conjugate gradient method without the agonizing pain. Technical report, Pittsburgh, PA, USA, 1994. [Taylor et al., 2001] George Taylor, Geoffrey Blewitt, Do- erte Steup, Simon Corbett, and Adrijana Car. Road re- duction filtering for gps-gis navigation. Transactions in GIS, 5(3):193–207, 2001. [Weston and Vapnik, 2002] Chapelle O. Elisseeff A. Scholkopf B. Weston, J. and V. Vapnik. Kernel de- pendency estimation. Advances in neural information processing systems, 2002. 96 Convex NMF on Non-Convex Massiv Data Kristian Kersting1 and Mirwaes Wahabzada1 and Christian Thurau2 and Christian Bauckhage2 1Knowledge Discovery Department, 2Vision and Social Media Group Fraunhofer IAIS, Schloss Birlinghoven, 53754 Sankt Augustin, Germany firstname.lastname@iais.fraunhofer.de Abstract We present an extension of convex-hull non- negative matrix factorization (CH-NMF) which was recently proposed as a large scale variant of convex non-negative matrix factorization (C- NMF) or Archetypal Analysis (AA). CH-NMF factorizes a non-negative data matrix V into two non-negative matrix factors V ≈ WH such that the columns of W are convex combinations of certain data points so that they are readily in- terpretable to data analysts. There is, however, no free lunch: imposing convexity constraints on W typically prevents adaptation to intrinsic, low dimensional structures in the data. Alas, in cases where the data is distributed in a non- convex manner or consists of mixtures of lower dimensional convex distributions, the cluster rep- resentatives obtained from CH-NMF will be less meaningful. In this paper, we present a hierarchi- cal CH-NMF that automatically adapts to inter- nal structures of a dataset, hence it yields mean- ingful and interpretable clusters for non-convex datasets. This is also conformed by our ex- tensive evaluation on DBLP publication records of 760,000 authors, 4,000,000 images harvested from the web, and 150,000,000 votes on World of Warcraft guilds. 1 Introduction Modern applications of data mining and machine learning in computer vision, natural language processing, compu- tational biology, and other areas often consider massive datasets and we need to run expensive algorithms such as principle component analysis (PCA), latent Dirichlet allo- cation (LDA), or non-negative matrix factorization (NMF) to extract meaningful, low-dimensional representations. If the data are words contained in documents, these methods yield topic models representing each document as a mixture of a small number of topics and each word is attributable to one of the topics. In computer vision, where it is common to represent images as vectors in a high-dimensional space, they extract visual words and have been used for face and object recognition, or color segmen- tation. Social networks such as Flickr, Facebook, or Mys- pace allow for a wide range of interactions amongst their members, resulting in massive, temporal datasets relating users, media objects, and actions. Here, low-dimensional representations may identify and summarize common so- cial activities. Therefore, given massive matrices of hundreds of mil- lions of entries, how can we efficiently factorize them? How can we create meaningful, low-dimensional represen- tations? How can we gain inside into the dataset? These are precisely the questions which we address in this paper. A recent positive development in data mining and ma- chine learning has been the realization that massive datasets are not only challenging but may as well be viewed as an opportunity [Torralba et al., 2008; Talwalkar et al., 2008]. Machine learning and data mining techniques typ- ically consist of two parts: the model and the data. Most effort in recent years has gone into the modeling part. Mas- sive datasets, however, allow one to move into the opposite direction: how much can the data itself help us to solve the problem? Halevy et al. [2009] even speak of the “the unreasonable effectiveness of data”. Massive datasets are likely to capture even very rare aspects of the problem at hand. Along this line, Thurau et al. [2009] have recently in- troduced a data-driven NMF approach, called convex-hull NMF, that is fast and scales extremely well: it can effi- ciently factorize gigantic matrices and in turn extract mean- ingful “clusters” from massive datasets containing millions of images and ratings. The key idea is to restrict the “clus- ters” to be combinations of vertices of the convex hull of the dataset; thus directly exploring the data itself to solve the convex NMF problem. There is, however, no free lunch: by restricting the “clus- ters” to combinations of vertices of the convex hull of the dataset, convex-hull NMF cannot adapt to the intrinsic, low dimensional structure of the data anymore. Intuitively, for data modeled by Gaussians, i.e., as a combination of con- vex sets, convex-hull NMF will assign clusters to the “ex- treme” Gaussians and not to each Gaussian. Our main con- tribution is a simple and, hence, powerful and scalable gen- eralization of convex hull NMF that automatically adapts to the intrinsic (low dimensional) structure in the data. The main insight is that one can use FastMap due to Faloutsos and Lin [1995] within convex-hull NMF to compute the convex hull vertices on massive datasets. Consequently, we can still solve the convex NMF problem but addition- ally adapt to the intrinsic structure in the data, when we split the data along each 1D FastMap line recursively, stop when some minimum node size is reached, and apply a post-pruning step. This way, we get meaningful clusters that respect the structure of the data, i.e., that diversify the results even better than convex-hull NMF. Our extensive experimental evaluation shows that the method, called hi- erarchical convex-hull NMF, achieves similar reconstruc- tion quality as convex-hull NMF with only a small over- head while it produces more diversified clusters on DBLP publication records of 760.000 authors, 4 million tiny im- 97 dark to normal #1440 #998 tilted left frontal #442 (a) NMF (b) C-NMF (c) CH-NMF (d) HCH-NMF #1461 light to normal to frontal Figure 1: The basis vectors resulting from different NMF variants applied to the CBCL Face Database 1. (a) Standard NMF results in part-based, sparse representations. Data points cannot be expressed as convex combinations of these basis elements. (b) Convex NMF (C-NMF) yields basis elements that allow for convex combinations. Moreover, the basis vectors are “meaningful” since they closely resemble given data points. They are, however, not indicative of characteristic variations among individual samples. (c) Convex-Hull NMF (CH-NMF) diversifies the basis vectors resulting in pale faces, faces with glasses, faces with beards, and so on. (d) Hierarchical CH-NMF (HCH-NMF) as proposed in the current paper also diversifies the results. Additionally, it automatically groups them, i.e., it identifies structure within the data. The induced hierarchical decomposition is shown together with the number of images falling into the corresponding subtrees. ages, and 150 million votes on World of Warcraft guilds. We proceed as follows. We start off by briefly review- ing non-negative matrix factorization (NMF) in Section 2, including convex NMF and convex-hull NMF. Then, we in- troduce hierarchical convex-hull NMF in Section 3. Before concluding, we present our extensive experimental evalua- tion in Section 4. 2 Non-Negative Matrix Factorization Assume an m × n input data matrix V = (v1, . . . ,vn) consisting of n column vectors of dimensionality m. We consider factorizations of the form V ≈ Wm×kHk×n . The resulting matrix W contains a set of k  n basis vec- tors which are linearly combined using the coefficients in H to represent the data. Common approaches to achieve such a factorization include Principal Component Analy- sis (PCA) [Jolliffe, 1986], Singular Value Decomposition (SVD) [Golub and van Loan, 1996], Vector Quantization (VQ), or non-negative Matrix Factorization (NMF) [Lee and Seung, 1999]. Various variants and improvements to NMF have been introduced in recent years. For example, Cai et al. [2008] presented a matrix factorization that obeys the geomet- ric data structure. In [Kim and Park, 2008], a speed im- provement to NMF is achieved using a novel algorithm based on an alternating nonnegative least squares frame- work. Another interesting variation is presented in [Suvrit, 2008] where optimization is based on a block-iterative ac- celeration technique. Recently, Mairal et al. [2010] have presented a very elegant online NMF approach based on sparse coding that also scales to large matrices (for ad- ditional related work, please see reference in [Mairal et al., 2010]). In this work, however, we build on Convex- NMF (C-NMF) recently introduced by Ding et. al [2009], and it is not clear how to adapt these advanced NMF tech- niques to it as Convex-NMF represents the data matrix V as a convex combination of data points, i.e. V = VGHT where each column i of G is a stochastic vector that obeys ‖gi‖1 = 1,gi ≥ 0 . This is akin to Archetypal Analysis ac- cording to Cutler and Breiman [1994] where both matrices G and HT are to be stochastic. Convex-NMF yields in- teresting interpretations of the data because each data point is now expressed as a weighted sum of certain data points (see Fig. 1). Convex NMF Convex non-negative matrix factorization (C-NMF) was introduced by Ding et al. [2009] and min- imizes J = ‖V − VGHT ‖2 , where V ∈ Rm×n,G ∈ Rn×k,H ∈ Rn×k. The matrices G and H are updated it- eratively until convergence using the following update rules Gik = Gik √ (Y+H)ik + (Y−GHTH)ik (Y−H)ik + (Y+GHTH)ik (1) and Hik = Hik √ (Y+G)ik + (HGTY−G)ik (Y−G)ik + (HGTY+G)ik (2) where Y = VTV, and the matrices Y+ and Y− are given by Y +ik = 1 2 |Yik| + Yik and Y −ik = 12 |Yik| − Yik , respec- tively. For the initialization of G and H two methods are pro- posed. The first initializes to (almost) unary representations based on a k-means clustering of V. The second assumes a given NMF or Semi-NMF solution. For further details on the algorithm and its initializations we refer to [Ding et al., 2009]. Recall, however, that our goal is to analyse massive, high-dimensional datasets. Unfortunately, the C-NMF up- date rules (1) and (2) have a time complexity of O(n2). Moreover, although the iterative algorithm comes down to simple matrix multiplications, the size of the involved matrices quickly becomes another limiting factor (similar to the intermediate blowup problem in tensor decomposi- tion [Kolda and Sun, 2008]), sinceVTV results in an n×n matrix. Switching to an online update rule would avoid memory issues but it would at the same time introduce ad- ditional computational overhead. Overall, we can say that C-NMF does not scale to large datasets. In the following, we will review Convex-Hull NMF (CH-NMF), which is a recent C-NMF method that is well suited for for large-scale data analysis. Convex-Hull NMF Convex-Hull NMF aims at a data factorization based on the data points residing on the data convex hull. Such a data reconstruction has two interesting properties: first, the basis vectors are real data points and mark, unlike in most other clustering/factorization tech- niques, the most extreme and not the most common data points. Second, any data point can be expressed as a convex and meaningful combination of these basis vectors. This 98 Algorithm 1: CH-NMF Input: Data matrix Vm×n Output: Matrices X and H 1 Compute k eigenvectors el, l = 1 . . . k of the covariance matrix of Vm×n; 2 Project V onto the 2D-subspaces E2×no,q = V T [eo, eq ], o = 1 . . . k, q = 1 . . . k, o 6= q; 3 Compute and mark convex hull data pointsHcvx(Eo,q) for each 2D projection; 4 Combine marked convex hull data points (using the original data dimensionalitym) Sm×p = {Hcvx(E1,2), . . . , Hcvx(Ek−1,k)}; 5 Optimize JS = ‖S− SIp×kJk×p‖2 such that ‖ii‖1 = 1 for ii ≥ 0 and ‖ji‖1 = 1 for ji ≥ 0; 6 Optimize J = ‖vi −XhTi ‖2 for i = 1 . . . n where X = Sm×pIp×k such that ‖hi‖1 = 1 and hi ≥ 0; offers interesting new opportunities for data interpretation as indicated in Fig. 1 and demonstrated in [Thurau et al., 2009]. More precisely, following Ding et al. [2009], one seeks a factorization of the form V = VGHT , where V ∈ Rm×n,G ∈ Rn×k,H ∈ Rn×k. One further restricts the columns of G and H to convexity, i.e., ‖gi‖1 = 1,gi ≥ 0 and ‖hj‖1 = 1,hj ≥ 0 . Indeed, Ding et al. [2009] also consider convex combinations but not for the matrix H. In other words, CH-NMF aims at factorizing the data such that each data point is expressed as a convex combination of convex combinations of specific data points. The task now is to minimize J = ‖V −VGHT ‖2 (3) such that ‖gi‖1 = 1,gi ≥ 0 and ‖hj‖1 = 1,hj ≥ 0 . To do so, one sets X = Vd×nGn×k . The intuition is as fol- lows. Since we assume a convex combination for X, and by definition of the convex hull, the convex hull Hcvx(V) of V must contain X. Obviously, we could achieve a per- fect reconstruction, giving J = 0, by setting G so that it would contain exactly one entry equal to 1 for each con- vex hull data point while all other entries were set to zero. Or more informal: following the definition of the convex hull we can perfectly reconstruct any data point by a con- vex combination of convex hull data points. Therefore, our goal becomes to solve Eq. (3) by finding k appropriate data points on the convex hull: J = ‖V −XHT ‖2 (4) such that xi ∈ Hcvx(V), i = 1, . . . , k . Finding a solution to Eq. (4), however, is not necessarily straight forward. It is known that the worst case complex- ity for computing the convex hull of n points in m dimen- sions is Θ(n m 2 ). Moreover, the number of convex hull data points may tend to n for high dimensional spaces, see e.g. [Donoho and Tanner, 2005; Hall et al., 2005] so that com- puting the convex hull of large data-sets quickly becomes practically infeasible. CH-NMF therefore seeks an approx- imate solution by subsampling the convex hull. It exploits the fact that any data point on the convex hull of a linear lower dimensional projection of the data also resides on the convex hull in the original data dimension. Since V contains finitely many points and therefore forms a poly- tope in Rm, we can resort to the main theorem of polytope theory, see e.g. [Ziegler, 1995]. In our context, it says that every vertex of an affine image of P , i.e., every point of the convex hull of the image of P , corresponds to a vertex of P . Therefore computing the convex hull of several 2D affine projections of the data offers a way of subsampling Hcvx(V). This is an efficient way as computing the con- vex hull of a set of 2D points can be done in O(n log n) time, [de Berg et al., 2000]. This subsampling strategy is the main idea underlying CH-NMF and, indeed, various methods can be used for lin- early projecting the data to a 2D space. Thurau et al. [2009] proposed to use PCA, i.e., projecting the data using pair- wise combinations of the first d eigenvectors of the covari- ance matrix of V as summarized in Alg. 1. For massive, high-dimensional data, however, computing the covariance matrix may take a lot of time. Therefore, we propose to use instead Faloutsos and Lin’s FastMap [1995], but please see next section. The main point here is, triggered by the idea that the ex- pected size of the convex hull of n Gaussian data points in the plane is Ω( √ log n) [Hueter, 1999], CH-NMF extracts only approximately p = j √ log n candidate points. This candidate set grows much slower than n. Given a candi- date set of p convex hull data points S ∈ Hcvx(V), we now select those k convex hull data points that yield the best reconstruction of the remaining subset S. This, again, can be formulated as a convex NMF optimization problem. We now have to minimize the following reconstruction er- ror JS = ‖Sm×p − Sm×pIp×kJk×p‖2 (5) under the convexity constraints ‖Ii‖1 = 1, Ii ≥ 0 and ‖Ji‖1 = 1,Ji ≥ 0 . Since p n, solving (5) can be done efficiently using a quadratic programming solver. Note that the data dimensionality is m. The convex hull projection only served to determine a candidate set; all further com- putations are carried out in the original data space. By obtaining a sufficient reconstruction accuracy for S, we can set X = Sm×pIp×k and thereby select k convex hull data points for solving Eq. (4). Typically, I results in unary representations. If this is not the case, we simply map SI to their nearest neighboring data point in S. Given X, the computation of the coefficients H is straight forward. For smaller data-sets it is possible to use the iterative update rule from Eq. (2). However, since we do not further modify the basis vectors X, we can also find an optimal solution for each data point vi individu- ally Ji = ‖vi −Xhi‖2 using common solvers. Obviously, this can be parallelized. 3 Hierarchical CH-NMF Modern datasets are not only massive, but often very com- plex. This makes it challenging to find useful information in the data. Fortunately, most datasets typically have a low intrinsic dimension. That is, the data lay on a smooth, structured low-dimensional manifold. As an illustrative (already low-dimensional) example consider Fig. 2. It de- picts a typical ”non-convex data” situation. We have drawn data points from 10 randomly positioned Gaussians in 2D and were interested in computing overcomplete represen- tations, i.e., the number of basis vectors is greater than the dimensionality of the input. Overcomplete represen- tations have been advocated because they have greater ro- bustness in the presence of noise, can be sparser, and can have greater flexibility in matching structure in the data. By design, CH-NMF assigns clusters to the ”extreme” Gaus- sians and not to ”inner” Gaussians, cf. Figs. 2 (a,c). Al- though we can still reconstruct each data point perfectly, this is discouraging. The intrinsic (low dimensional) struc- ture of the data is not captured and, in turn, the representa- tion of the data found is not as meaningful as it could be. 99 (a) CH-NMF bv=4 (b) HCH-NMF bv=4 (c) CH-NMF bv=8 (d) HCH-NMF bv=8 (e) HCH-NMF bv=16 Figure 2: Resulting basis vectors of CH-NMF (a,c) for 4 resp. 8 basis vectors (bv) and of HCH-NMF (b,d,e) for 4, 8, resp. 16 basis vectors. The data samples are drawn from 10 randomly placed Gaussian distributions in 2D, 500 samples per Gaussian. For 4 basis vectors (b), HCH-NMF essentially mimics CH-NMF. For more basis vectors (d,e), however, it starts to adapt to the structure of the data: the basis vectors reside on the convex hulls of the Gaussians. CH-NMF’s basis vectors (a,c), in contrast, remain residing on the convex hull. By design, it considers the ”extreme” Gaussians only and does not adapt to the structure of the data. (Best viewed in color.) Algorithm 2: HCH-NMF: Hierarchical Convex-Hull NMF Input: Data, i.e., the set of rows vi of Vm×n; the pairwise distances D; the minimal size MinSize of a leave Output: A hierarchical decomposition ofD represented as tree T 1 if |V| < MinSize then 2 return CH-NMF(Leaf) for l (even multiple of k) basis vectors 3 else 4 Rule←CHOOSERULE(V); 5 Vs ← {vi ∈ V | Rule(vi) = true}; 6 Vf ← {vi ∈ V | Rule(vi) = false}; 7 LeftTree←MAKETREE(Vs); 8 RightTree←MAKETREE(Vf ); 9 return (Rule, LeftTree, RightTree) Hierarchical convex-hull NMF (HCH-NMF) is a convex NMF approach that automatically adapts to the low intrin- sic dimensionality of data as illustrated in Figs. 2 (b,d,e). The elegance of HCH-NMF stems from two facts: • It naturally falls out of running CH-NMF using FastMap [Faloutsos and Lin, 1995] for efficiently computing and marking convex hull data points. • In turn, it provably solves the convex NMF problem as it directly makes CH-NMF manifold-adaptive. The latter point is difficult to prove for a two-steps ap- proach: run any clustering approach, then run CH-NMF on the clusters. Also employing any of the existing large- scale manifold learning methods, see e.g. [Talwalkar et al., 2008] is difficult. As Talwakar et al. [2008], argue they re- quire a O(n3) spectral decomposition of matrices where n is the number of samples. When the matrix is sparse, these techniques can be implemented relatively efficiently. How- ever, when dealing with a large, dense matrix, the involved matrix products become expensive to compute. As summarized in Alg. 2, HCH-NMF is based on a hi- erarchical decomposition of RD in form of a tree1. That is, it starts with the empty tree and repeatedly searches for the best test for a node according to some splitting criterion such as weighted variance along the FastMap dimension. Next, the examples V in the node are split into Vs (suc- cess) and Vf (failure) according to the test. For each split, the procedure is recursively applied, obtaining subtrees for 1In this paper, we follow Dasgupta and Freund’s random pro- jection trees RPTs) [Dasgupta and Freund, 2009a]. We do not split along a random direction but note that this could easily be achieved. the respective splits. We stop splitting if a minimum num- ber MinSize of examples is reached or the variance in one node is small enough. In the leaves, we run CH-NMF on the examples falling into the leaves to find l basis vectors. Finally, we may run a post-processing step to find the best k basis vectors. In other words, HCH-NMF is conceptually easy, yet scalable to massive datasets and powerful as our experimental results will demonstrate. Let us briefly review FastMap. FastMap computes a u-dimensional Euclidean embedding and proceeds as fol- lows. Given pairwise distances among objects, in our case the rows of the data matrix V, we select a pair of distant objects called pivot objects. Then, we draw a line between the pivot objects. Essentially, it serves as the first coordi- nate axis. For each object o, we determine the coordinate value fm(o) along this axis by projecting o onto this line. Next, the pairwise distances of all objects are updated to reflect this projection, i.e., we compute the pairwise dis- tances among the objects in the subspace orthogonal to the line. This process is repeated until, after u iterations, we get the u coordinates as well as the u-dimensional repre- sentation of all objects. Ostrouchov and Samatova [2005] have shown that the pivots are taken from the faces, usually vertices, of the convex hull of the data points in the original implicit Euclidean space. This justifies the idea to employ FastMap in step (3) ”compute and mark convex hull data points” of CH-NMF as already mentioned in the last sec- tion. More important for HCH-NMF, it suggests a natural splitting criterion to produce a hierarchical decomposition: Split the data according to the weighted variance along the 1D FastMap line. More precisely, we compute the splitting rule as summarized in Alg. 3. Essentially, we run one iter- ation of FastMap and split along the “FastMap” line w.r.t. weighted variance. That is, we pick a pair of distant ob- jects x and y (lines 1-3). Then, we project the data onto the line and compute for each data point its 1D coordinate value (lines 4-6). Now, we compute the split variable θ that minimizes the weighted variance (lines 8-12) and return the corresponding splitting rule (lines 13-14). Thus, for each splitting variable, determining the split point s can be done very quickly. In turn, by scanning all of the inputs, deter- mining the best split is feasible and scales as O(n log n). How large should we grow the tree? Clearly, a very large tree might produce too many basis vectors. The basis vec- tors are likely to be not meaningful and the tree essentially “overfits” the data. A small tree, however, might not cap- 100 Algorithm 3: CHOOSERULE based on FastMap Input: Data, i.e., the set of rows vi of Vm×n; the minimal size of a leave MinSize Output: A splitting rule Rule for the data 1 Pick any data point t ∈ V; 2 Let x be the farthest point from t in V; 3 Let y be the farthest point from x in V; 4 for i = 1, 2, . . . n do 5 Project vi onto the line spanned by x and y; 6 Let fm(vi) be vi’s 1D coordinate value; 7 Sort the values fm(vi) generating the list s1 ≤ s2 ≤ . . . ≤ sn; 8 for i = 1, 2, . . . n do 9 µ1 = 1i ∑i j=1 sj ; 10 µ2 = 1n−i ∑n j=i+1 sj ; 11 ci = ∑i j=1 (sj − µ1)2 + ∑n j=i+1 (sj − µ2)2; 12 Find i that minimizes ci and set θ = (si + si+1)/2; 13 Rule(v) := fm(v) ≤ θ; 14 return (Rule) ture the important structure in the data at all. Reconsider standard CH-NMF. It corresponds to a tree of depth zero. In other words, the tree size is a tuning parameter govern- ing HCH-NMF’s complexity and one of its key advantages: the diversified meaningfulness of its factorization. The pre- ferred strategy, as for example explained in [Hastie et al., 2001], proceeds as follows. We grow a large tree T0, stop the splitting process only when some minimum leaf size, say 200, is reached. Then, this tree is pruned in a post- processing step. We omit a detailed algorithmic description but rather briefly describe it now. The goal is to compute k ≥ l many basis vectors that reconstruct the data the best. We achieve this by successively merging neighboring leafs until we get k basis vectors. In each step, we find the two neighboring leafs that — if merged — produce the lowest reconstruction error over the covered examples. As this can be time consuming, we efficiently approximate it by always selecting the two neighboring leaves with highest resulting cluster accuracy when merging them. In conclusion, HCH-NMF with a tree of depth 0 essen- tially coincides with CH-NMF. Moreover, the well known fact that if Z is any convex set that contains a convex hull conv(U) of a set U , in particular Z = conv(U ∪ V ), then conv(U) ⊆ Z, see e.g. [Boyd and Vandenberghe, 2004], essentially proves the following correctness theorem. Theorem 1 HCH-NMF solves the convex NMF problem. It produces convex combinations of the data points in terms of basis vectors that minimize (3) but reside on convex hulls of clusters of data points. Due to the hierarchical decomposition of the data, HCH- NMF can adapt to the structure underlying the data as illus- trated in Figs. 2 (b,d,e). Indeed, it is akin to k-means. Be- cause k-means clustering is a NP-hard optimization prob- lem, see e.g. [Dasgupta and Freund, 2009b], this suggests that it is very unlikely that there exists an efficient algo- rithm for it. 4 Experiments Our intention here is to investigate whether HCH-NMF can indeed find meaningful basis vectors in massive datasets and how it compares to CH-NMF. To this aim, we im- plemented both in Python using FastMap and the h5py HDF5 interface to deal with massive data. For optimization we used the cvxopt library by Dahl and Vandenberghe2. (H)CH-NMF offers many opportunities for parallelization. 2http://abel.ee.ucla.edu/cvxopt/ 0 1000 2000 3000 4000 5000 number of data points 10-5 10-4 10-3 10-2 10-1 100 101 av er ag e re co ns tr uc tio n er ro r i n lo g sp ac e Tree Convex Hull NMF Convex Hull NMF 0 1000 2000 3000 4000 5000 number of data points 10-2 10-1 100 101 102 co m pu ta tio n tim e in s ec on ds in lo g sp ac e Tree Convex Hull NMF Convex Hull NMF Figure 3: Boxplots of reconstruction errors (top) and com- putation times [sec.] (bottom) (both in log-space) of HCH- NMF and CH-NMF for varying numbers of synthetically generated data averaged over five reruns. As mentioned before, the Frobenius norm for CH-NMF has to be lower as CH-NMF approximates the convex hull of the complete data distribution and ignores the intrisic structure. Thus, for the purpose of interpretation and diversity this is probably not the right metric. (Best viewed in color.) In the experiments, we only distributed the final recon- structions equally among all available cores. All experi- ments were ran on a standard Intel 3GHz computer with two cores. We report running times only for comparison of CH-NMF and HCH-NMF. Clearly, a C/C++ implementa- tion would run several orders of magnitude faster. We conducted four different experiments. To compare running time and reconstruction performance in a con- trolled setup, we compared (H)CH-NMF on synthetically generated data. Our main focus, however, are three addi- tional experiments on massive real-world datasets, namely, publication histories of 760, 000 DBLP authors, 1.4 mil- lion activity profiles of guilds, and 4 million images of the Tiny image data-set [Torralba et al., 2008]. For the sake of a better visualization, we show small trees. Synthetic Data: Along the lines of [Ding et al., 2009; Thurau et al., 2009], we evaluated the mean reconstruction error and run-time performance using a varying number of data points sampled from three randomly positioned Gaus- sians in 2D. As already shown in [Thurau et al., 2009], CH- NMF outperforms C-NMF for larger numbers of samples: it is several orders of magnitude faster while achieving competitive reconstruction errors. Therefore, we only com- pared HCH-NMF against CH-NMF. We varied the number of sampled data points ranging from 100 to 5000 in steps of 100. The maximum number of iterations for any numer- ical optimization was 100. The number of basis vectors was set to 12, i.e., we searched for overcomplete represen- tations. The results averaged over 5 reruns are summarized in Fig. 3. As one can see, HCH-NMF produces an overhead 101 S.M. Reddy C.H. Papadimitriou junior researcher; several publication over some yearsextremely extremely early career; 97%3% senior researcher prolific for more than 40 years prolific for more than 30 years senior researcher; constantly publishing over several years few publications 2% constant activity till 2nd update constantlyconstantly improving, constant (low) activity, no activity after 2nd update 3% very active with 1nd update constant activity till 2nd update constant but Formed late, active till 2nd improving till 2nd update drop after 1st but boost with 2nd update constant activity till 2nd update formed early then slowly disbanded very low activity update improving till 1st update very active with 1st update 5% constant (low) activity, no activity after 1st update initially active, then fading out till update 90% Bosted activity by 1st update Formed early then slowly disbanded till first update Figure 4: (Top) Clusters and corresponding basis vectors (histograms) found by HCH-NMF on the DBLP dataset. By describing the basis vectors, we gain an intuitively un- derstandable description of academic careers: 97% of all authors are best described by the typical phases of an aca- demic career. Indeed, there are renown, extremely pro- lific exceptions. Because the basis vectors are actual data points, we can identify them as Papadimitriou and Reddy. (Bottom) Clusters and corresponding basis vectors found by HCH-NMF on the World of Warcraft R© dataset. (Best viewed in color.) in running time for smaller sample sizes. For larger sample sizes, HCH-NMFs catches up. The reconstruction error is slightly higher than for CH-NMF but still lower than for k-means (as reported in [Thurau et al., 2009]). The higher reconstruction error is due to the natur of convex hulls: any point within the convex hull can perfectly reconstructed. Hence, CH-NMF is a lower bound of HCH-NMF in terms of reconstruction error. Bibliographic Analysis based on DBLP: Bibliographic databases such as DBLP3 are a rich source of information. Here, we are interested in the question whether there are common patterns in the development of academic careers. To this aim, we extracted from DBLP the cumulative pub- lication histograms of 757, 368 authors, cf. Fig. 4(Top). A publication histogram consists of the number of publi- cations listed in DBLP in her first year, second year, and so on. We have cumulated the publications numbers of the years. The longest histogram we found spanned 68 years. To get equal length curves we filled missing years with 0. Following [Aitchison, 1982], we use logarithmic histogram values in our analysis. The idea is that the publication his- togram of an author is a good descriptor for her activity and also to some extend for her success (but of course has not to imply impact/quality). A senior researcher, for example, is likely to have contributed over several years but there are 3http://www.informatik.uni-trier.de/˜ley/ db/ exceptionally prolific authors. PhD students, on the other hand, may not have published many papers. We expected HCH-NMF to discover these variations. This was indeed the case as shown in Fig. 4 (Top). The patterns found can be summarized as ”the majority of authors fall into one of the phases of a regular academic career (student, junior, senior) but of course there are illustrious exceptions. It took 1 hour to compute the model, i.e., growing the tree and pruning it. CH-NMF took 45 minutes essentially yield- ing the union of all shown basis vectors, hence, giving the impression ”there is a Papadimitriou in all of us”. Social Network World of Warcraft R©: This dataset consists of recordings of the online appearance of char- acters in the computer game World of Warcraft R©. It is assumed that World of Warcraft R© has about 12 million p(l)aying customers. The game takes place in a virtual me- dieval fantasy environment. World of Warcraft R© is often considered one large social platform which is used for chat- ting, team-play, and gathering. Compared to well known virtual worlds that mainly serve as chat platforms such as Second-life R©4, World of Warcraft R© is probably the real second life as it has a larger and more active (paying) user base. Moreover, a whole industry is developing around World of Warcraft R©. It is estimated that 400.000 people world-wide are employed as gold-farmers, i.e. collecting virtual goods for online games and selling them over the Internet. Players organize in groups, which are called guilds. Un- like groups known from other social platforms, such as Flickr, membership in a guild is exclusive. Obviously, the selection of a guild influences with whom players fre- quently interact. It also influences how successful play- ers are in terms of game achievements. We assume that the level distribution among a guild is a good descriptor for its success and activity. For example, a guild of very experienced level 80 characters has a higher chance for achievements than a guild of level 10 players. Also, a level histogram gives an indicator for player activity over time. If players are continuously staying with a particular guild, we expect an equally distributed level histogram, as the characters are continuously increasing their level over time. The data was crawled from the publicly accessi- ble site www.warcraftrealms.com. We viewed each character online appearance as a vote. Characters obser- vations span a period of 4 years. Every time a character is seen online, he votes for the guild he is a member of accord- ing to his level. We accumulate the votes into a level-guild histogram, going from level 10 (level 1-9 are excluded) to level 80 (the highest possible level). Players advance in level by engaging in the game, i.e. completing quests or other heroic deeds. Following [Aitchison, 1982], we use logarithmic histogram values in our analysis. In total, we collected 150 million votes of 18 million characters belong- ing to 1.4 million guilds. Running HCH-NMF took about 2.5 hours and revealed some very interesting patterns as shown in Fig. 4 (Bottom). As for CH-NMF, see [Thurau et al., 2009], we can also spot singular events, in our case large updates to the game con- tent (this is a regular procedure that makes novel content available and also allows a further advancement in charac- ter level). Apparently, large updates to the game can re- sult in a restructuring of social groups. More interestingly, HCH-NMF allows us to descriptions of clusters. For in- stance, 90% of the data can be described in terms of just 4http://secondlife.com/ 102 100% Pivot/archetype „left“ Pivot/archteype „right“ Split element constant activity till 2nd update seldom active Constant low activity till 1st upadte 90%10% 5% 3% 2% 5% Figure 5: (Top) HCH-NMF tree as well as the pivot and split elements for the first split on the World of Warcraft R© dataset. The ”90%” cluster in Fig. 4 captures the data ”between” the split element and the ”right” pivot element. Thus, the majority of guilds is actually close to the ”low constant activity” or ”seldom activity” guilds. Both patterns are well captured by combining the corresponding four basis vectors in Fig. 4. (Bottom) HCH-NMF’s split elements for the 4 million tiny images are natural images. (Best viewed in color.) four basis vectors: ”formed early then slowly disbanded till 1st update”, ”improving till 1st update”, ”very active with 1st update”, and ”Boosted activity with 1st update”. That was surprising. We knew that most guilds are rather seldom active, see [Thurau et al., 2009]. Therefore, we ”zoomed” into the model, a feature of HCH-NMF not sup- ported by CH-NMF. Specifically, we had a look at the pivot and split elements of the first level of the induced tree, see Fig. 5 (Top). This revealed that most of the 90% are ac- tually ”seldom active”: they lay between the ”constantly active till 2nd update” and ”seldom active” guild on the FastMap line. The four basis vectors of the 90% cluster together are archetypical guilds to reconstruct this pattern. Running CH-NMF took about 2 hours and produced es- sentially the same basis vectors. The major difference was that it used the ”seldom active” guild directly and did not factorize it. Massive Image Collection: Our final experiment ap- plies HCH-NMF to a subset of 4 million images of Tor- ralba et al.’s 80 million tiny images [Torralba et al., 2008]. The images are represented as 384 dimensional GIST fea- ture vector. The result of running HCH-NMF is shown in Fig. 6. As already reported in [Thurau et al., 2009], some of the basis vectors discovered show a geometric similarity to Walsh filters that are found among the principal com- ponents of natural images [Heidemann, 2006]. This sug- gests that the extremal points in this large collection of nat- ural images are located close to the principal axes of the data. On the other hand, the split elements as shown in Fig. 5 (Bottom,) are mostly ”realistic” images. This sug- gests that they are located in the center of the data/clusters. In contrast to CH-NMF, HCH-NMF grouped the basis vec- tors into meaningful clusters. From left to right, the ba- sis vectors, i.e., the columns, capture different aspects of the images such as dark-light-dark, horizontal-cross/circle- vertical, complex-plain-complex. Overall, running HCH- NMF took only 23 hours and 37 minutes. This is remark- able as we used an external USB hard disk and the running time can be considerably reduced using an internal hard disk and implementing HCH-NMF fully in C/C++. 5 Conclusion We have introduced hierarchical convex-hull NMF. It seeks to leverage convex NMF by automatically adapting to the geometric structure of the data. It is fast and straightfor- ward to implement, provably solves the convex NMF prob- lem, combines the interpretability of both convex NMF as well as hierarchical decomposition methods, and scales well to massive, high-dimensional datasets. These contri- butions advance the understanding of descriptive analytics of massive, high-dimensional datasets and is an encourag- ing sign that applying NMF techniques in the wild, i.e., on hundreds of millions of data points may not be insurmount- able. There are several interesting avenues for future work. One is the application of HCH-NMF to other challenging datasets, such as Wikipedia, Netflix, Facebook, or the blog- sphere, and to use it for applications such as collaborative filtering. For the latter case, it is interesting to develop bot- tom up HCH-NMF variants and to investigate the missing values case. Another important avenue is parallelization. HCH-NMF suggests a natural data-driven parallelization: the training set is partitioned into subsets associated with separate processors. Finally, HCH-NMF is highly rele- vant for high-dimensional classification problems. Here, it is infeasible to include enough training samples to cover the class regions densely. As Cevikalp et al. [2008] have recently pointed out, irregularities in the resulting sparse sample distributions cause local classifiers such as near- est neighbors and kernel methods to have irregular decision boundaries. One solution is to ”fill in the holes” by build- ing a convex model of the regions spanned by the training samples. Using HCH-NMF, we even take the geometric structure of each class into account. Acknowledgment The authors would like to thank A. Torralba, R. Fergus, and W. T. Freeman for making the tiny images freely available and S. Zayakh for prepar- ing the DBLP dataset. M. Wahabzada and K. Kersting were supported by the Fraunhofer ATTRACT Fellowship STREAM. References [Aitchison, 1982] J. Aitchison. The Statistical Analysis of Compositional Data. J. of the Royal Statistical Society B, 44(2):139–177, 1982. [Boyd and Vandenberghe, 2004] S. Boyd and L. Vanden- berghe. Convex Optimization. Camb. Univ. Press, 2004. [Cai et al., 2008] Deng Cai, Xiaofei He, Xiaoyun Wu, and Jiawei Han. Non-negative matrix factorization on man- ifold. In International Conference on Data Mining, pages 63–72. IEEE, 2008. [Cevikalp et al., 2008] H. Cevikalp, B. Triggs, and R. Po- likar. Nearest hyperdisk methods for high-dimensional classification. In Proceedings of the 25th International Conference on Machine Learning (ICML-08), 2008. 103 Figure 6: HCH-NMF clusters (columns) and corresponding basis vectors (images in the columns) of the 4 million tiny images. From left to right, the basis vectors capture different aspects of the images such as dark-light-dark, horizontal- cross/circle-vertical, complex-plain-complex. (Best viewed in color.) [Cutler and Breiman, 1994] A. Cutler and L. Breiman. Archetypal Analysis. Technometrics, 36(4):338–347, 1994. [Dasgupta and Freund, 2009a] S. Dasgupta and Y. Freund. Random projection trees and low dimensional mani- folds. In R.E. Ladner and C. Dwork, editors, Proceed- ings of the 40th Annual ACM Symposium on Theory of Computing (STOC-08), pages 537–546, 2009. [Dasgupta and Freund, 2009b] S. Dasgupta and Y. Freund. Random projection trees for vector quantization. IEEE Transactions on Information Theory, 55:3229–3242, 2009. [de Berg et al., 2000] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. Computational Ge- ometry. Springer, 2000. [Ding et al., 2009] C.H.Q. Ding, T. Li, and M.I. Jordan. Convex and Semi-Nonnegative Matrix Factorizations. IEEE Trans. on Pattern Analysis and Machine Intelli- gence, 2009. Accepted for publication. [Donoho and Tanner, 2005] D.L. Donoho and J. Tanner. Neighborliness of Randomly-Projected Simplices in High Dimensions. Proc. of the Nat. Academy of Sci- ences, 102(27):9452–9457, 2005. [Faloutsos and Lin, 1995] C. Faloutsos and K.-I. Lin. FastMap: A Fast Algorithm for Indexing, Data- mining and Visualization of Traditional and Multimedia Datasets. In Proc. ACM SIGMOD, 1995. [Golub and van Loan, 1996] G.H. Golub and J.F. van Loan. Matrix Computations. Johns Hopkins University Press, 3rd edition, 1996. [Halevy et al., 2009] A.Y. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24:8–12, 2009. [Hall et al., 2005] P. Hall, J. Marron, and A. Neeman. Ge- ometric Representation of High Dimension Low Sam- ple Size Data. J. of the Royal Statistical Society B, 67(3):427–444, 2005. [Hastie et al., 2001] T. Hastie, R. Tibshirani, and J. Fried- man. The Elements of Statistical Learning: Data Min- ing, Inference, and Prediction. Springer, 2001. [Heidemann, 2006] G. Heidemann. The principal compo- nents of natural images revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(5), 2006. [Hueter, 1999] I. Hueter. Limit Theorems for the Convex Hull of Random Points in Higher Dimensions. Trans. of the American Mathematical Society, 351(11):4337– 4363, 1999. [Jolliffe, 1986] I.T. Jolliffe. Principal Component Analy- sis. Springer, 1986. [Kim and Park, 2008] Jingu Kim and Haesun Park. To- ward faster nonnegative matrix factorization: A new al- gorithm and comparisons. In International Conference on Data Mining, pages 353–362. IEEE, 2008. [Kolda and Sun, 2008] Tamara G. Kolda and Jimeng Sun. Scalable tensor decompositions for multi-aspect data mining. In International Conference on Data Mining, pages 363–372. IEEE, 2008. [Lee and Seung, 1999] D. D. Lee and H. S. Seung. Learn- ing the parts of objects by non-negative matrix factor- ization. Nature, 401(6755):788–799, 1999. [Mairal et al., 2010] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. JMLR, 11:19–60, 2010. [Ostrouchov and Samatova, 2005] G. Ostrouchov and N.F. Samatova. On FastMap and the convex hull of multivariate data: toward fast and robust dimension reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1340–1434, 2005. [Suvrit, 2008] Sra Suvrit. Block-iterative algorithms for non-negative matrix approximation. In ICDM, pages 1037–1042. IEEE, 2008. [Talwalkar et al., 2008] Ameet Talwalkar, Sanjiv Kumar, and Henry Rowley. Large-scale manifold learning. In Computer Vision and Pattern Recognition. IEEE, 2008. [Thurau et al., 2009] C. Thurau, K. Kersting, and C. Bauckhage. Convex non-negative matrix factoriza- tion in the wild. In H. Kargupta and W. Wang, editors, Proceedings of the 9th IEEE International Conference on Data Mining (ICDM-09), 2009. [Torralba et al., 2008] Antonio Torralba, Rob Fergus, and William T. Freeman. 80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recog- nition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 30(11):1958–1970, 2008. [Ziegler, 1995] G.M. Ziegler. Lectures on Polytopes. Springer, 1995. 104 Quantitatives Frequent-Pattern Mining u¨ber Datenstro¨men Daniel Klan, Thomas Rohe Department of Computer Science & Automation TU Ilmenau, Germany {first.last}@tu-ilmenau.de Abstract Das Aufdecken unbekannter Zusammenha¨nge za¨hlt zu einer der wichtigsten Aufgaben im Da- ta Mining. Fu¨r das Problem des Frequent Pat- tern Mining u¨ber statischen Daten finden sich da- her in der Literatur eine Vielzahl an Lo¨sungen. Die Integration von Sensorik in nahezu jeden Lebensbereich fu¨hrt allerdings zu Datenmen- gen, welche mittels der klassischen Verfahren zumeist nicht mehr bewa¨ltigt werden ko¨nnen. Ein Paradigmenwechsel hin zur Datenstrom- Verarbeitung ist oftmals unumga¨nglich. Ein in- teressantes Problem, welches im Zusammenhang mit der Verarbeitung von Sensordaten auftritt ist der prinzipiell stetige Wertebereich von Messun- gen. Die bekannten Lo¨sungen sind fu¨r die Analy- se von kontinuierlichen Daten u¨ber stetigen Wer- tebereichen nur bedingt geeignet. Im folgenden soll mit dem FP 2-Stream ein entsprechendes Verfahren fu¨r die Analyse quantitativer ha¨ufiger Muster u¨ber Datenstro¨men pra¨sentiert werden. 1 Einleitung Das Finden ha¨ufiger Muster (frequent pattern mining) ist Grundlage fu¨r eine Vielzahl von Data Mining Problemen (zum Beispiel der Korrelationsanalyse, oder dem Finden von Sequenzen oder Perioden innerhalb des Datenstromes). Die popula¨rste Anwendung fu¨r das Frequent Pattern Mi- ning ist das Assoziation Rule Mining, bei welcher aus Transaktionen Werte extrahiert und in Relation zueinander gesetzt werden. Eine typische Anwendung ist die Analy- se von Supermarkt-Transaktionen, bei welcher das Kauf- verhalten der Kunden analysiert wird. Ziel ist dabei die Detektion von Regeln, welche das Kaufverhalten einer re- pra¨sentativen Menge von Kunden wiederspiegeln und an- hand derer anschließend die Verkaufsprozesse optimiert werden ko¨nnen. Eine mo¨gliche Regel ist “Bier → Chips (10%)”, welche besagt, dass 10 Prozent aller Kunden, die Bier gekauft haben auch Chips kaufen. Wie die meisten Data Mining Verfahren, so wurde auch das Frequent Pattern Mining urspru¨nglich fu¨r statische Da- tenbesta¨nde entwickelt. In den letzten Jahren wurden die- se Verfahren an das Datenstrom-Paradigma angepasst [5; 8]. In Interessantes Beispiel fu¨r die Anwendung der Asso- ziation Rule Mining u¨ber Datenstro¨men ist in [4] beschrie- ben. Die Autoren ersetzen dabei fehlende Werte in einem WSN anhand zuvor abgeleiteter Assoziationsregeln. In einer Vielzahl von Anwendungen liegen die Daten zur Analyse nicht in Form kategorischer, sondern als quanti- tative Attribute vor. Die Anwendung der bekannten Fre- quent Pattern Mining Verfahren fu¨hrt hierbei oftmals nicht zu den gewu¨nschten Zielen bzw. der Berechnungsaufwand ist nicht vertretbar. Basierend auf dieser Feststellung entwi- ckelten die Autoren in [11] erstmals ein Verfahren fu¨r das Finden von Regeln, welche sowohl mit kategorischen Attri- buten als auch mit quantitativen Attributen umgehen kann. Das vorgestellte Verfahren bildet dabei quantitative auf ka- tegorische Attribute ab. Die quantitativen Attribute wer- den im wesentlichen durch Intervalle dargestellt, welche eine Menge von numerischen Werten aufnehmen ko¨nnen. Auf Basis dieser Definition pra¨sentierten die Autoren einen angepassten Apriori-Algorithmus, welcher den Suchraum zerlegt (diskretisiert) und anschließend durch Kombination der Teil-Intervalle Regeln findet. Es existieren eine Vielzahl von Anwendungsszenarien in denen die Extraktion quantitativer Regeln bzw. das Finden quantitativer Frequent Itemsets u¨ber Datenstro¨men von In- teresse sind (z.B. Geba¨udeu¨berwachung). Die in der Lite- ratur pra¨sentierten Verfahren beziehen allerdings sich aus- schließlich auf das Finden von Regeln u¨ber statischen Da- tenbesta¨nden. Fu¨r ein Verarbeitung von Datenstro¨men sind diese ungeeignet. Zudem weisen die meisten Verfahren Be- schra¨nkungen auf, derart dass quantitative Attribute ledig- lich im Regel-Kopf auftreten du¨rfen. Die Arbeit ist im weiteren wie folgt aufgeteilt. Zuna¨chst folgt in Abschnitt 2 ein kurzer U¨berblick u¨ber existieren- de Arbeiten zum Thema. Anschließend wird in Abschnitt 3 auf fu¨r das Versta¨ndnis notwendige Grundlagen eingegan- gen. In Abschnitt 4 folgt eine ausfu¨hrliche Beschreibung des entwickelten Verfahrens, welche in Abschnitt 5 evalu- iert werden soll. 2 Verwandte Arbeiten Eines der ersten Verfahren zum Finden von ha¨ufigen Mus- tern war das von Aggrawal et. al pra¨sentierte Apriori- Verfahren [1]. Die Grundlage des Ansatzes bildet die Annahme, dass sich ha¨ufige Muster ausschließlich aus ha¨ufigen Teilmustern zusammensetzen ko¨nnen. Basierend auf dieser Heuristik erzeugt das Verfahren alle mo¨glichen k− itemset Kandidaten aus der Menge aller k− 1-itemset Kandidaten. Aufgrund der wiederholten Analyse der Daten handelt es sich hierbei um ein multi-pass Verfahren, wel- ches seine Anwendung insbesondere in der Verarbeitung von endlichen Datenmengen findet. Auf Basis des Apriori-Algorithmus wurden eine Viel- zahl an weiteren Algorithmen entwickelt, welche unter an- derem die Performance-Steigerung oder die Adaption an Datenstro¨me zum Ziel hatten. So stellen die Autoren in [3] ein effizientes Verfahren zum Finden von optimalen links- 105 seitigen quantitativen Regeln vor. Der Schwerpunkt ist da- bei die Entwicklung eines Verfahrens zur Berechnung der optimalen Bereiche in linearer Zeit unter der Annahme das die Daten sortiert sind. Fu¨r denn Fall, dass eine Sortierung der Daten nicht mo¨glich ist, wurde zusa¨tzlich ein randomi- siertes Verfahren pra¨sentiert, welches die Daten in Buckets teilt und anschließend auf diesen die Suche durchfu¨hrt. Die Autoren in [9] pra¨sentierten mit dem DHP (Direct Hashing and Pruning) ein Verfahren, dessen Ziel die Optimierung der Anzahl an Kandidaten ist. Das Verfahren setzt dabei auf eine Hash-Tabelle fu¨r die Identifikation von Kandida- ten mit einem Support gro¨ßer dem geforderten. Han et. al pra¨sentieren in [5] mit dem FP-Tree (frequent pattern tree) einen Prefix-Baum zur effizienten Speiche- rung ha¨ufiger Muster. Zusa¨tzlich zu dieser Datenstruktur stellen sie mit dem FP-Growth Algorithmus ein Verfahren zum Finden von ha¨ufigen Mustern auf Basis des FP-Tree vor. Im Gegensatz zum Apriori Verfahren verzichtet das FP-Growth Verfahren vollsta¨ndig auf die Generierung von Kandidaten. Auch sind lediglich zwei Durchla¨ufe u¨ber die zu pru¨fende Datenmenge ausreichend. FP-Mining u¨ber Datenstro¨men stellt eine besondere Her- ausforderung dar. Neben dem in [5] beschriebenen FP 2- Stream existieren auch eine Vielzahl weiterer Lo¨sungen. Der in [12] vorgestellte Compact Pattern Tree (CPT) zum Beispiel verwendet gleitende Fenster (analog dem FP 2- Stream und dem DSTree [7]). Die Effizienz des Verfahrens wird dabei durch eine zusa¨tzliche Restrukturierungs Phase erreicht, wa¨hrend der der CPT in Abha¨ngigkeit der Itemset- Ha¨ufigkeiten mit dem Ziel einer hohen Kompaktheit um- sortiert wird. In [8] beschreiben die Autoren neben dem Sticky-Sampling und Lossy-Counting mit BTS(Buffer- Trie-SetGen) ein Verfahren, welches die Ha¨ufigkeiten von Itemsets mittels einer Gitter-Strukur speichert. Die erste Arbeit die sich mit dem Problem der quantita- tiven Muster Erkennung bescha¨ftigt stammt von Aggrawal [11]. Das vorgestellte Verfahren basierte dabei im wesent- lichen auf einem Apriori-Verfahren, welches die partitio- nierten Wertebereiche der einzelnen Attribute in geeigneter Weise kombiniert. Die Autoren in [2] betrachten quantitativ/kategorische Assoziationsregeln. Eine Regel wird signifikant angese- hen, wenn sich deren Mittelwert in Relation zur Menge aller Transaktionen ohne dieses Regel als signifikant her- ausstellt. Zur Berechnung des Signifikanzlevels ziehen sie dabei den Steiger Z-Test heran. Als Null-Hypothese neh- men sie an, das die Mittelwerte beider Teilmengen gleich sind. Wird diese Null-Hypothes abgelehnt (mit einer Konfi- denz von 95%), so wird die gefundene Regel als signifikant unterschiedlich von der Restmenge der Transaktionen an- genommen. 3 Vorbetrachtung Zuna¨chst mu¨ssen verschiedene Begrifflichkeiten eingefu¨hrt werden, welche fu¨r das Versta¨ndnis des im weiteren pra¨sentierte Verfahren notwendige sind. Im folgenden bezeichnet ai ein Attribut. ai〈v〉 ent- spricht dem zum Attribut ai geho¨render Wert v. Das Tripel (ai, l, r) = ai(l, r) bezeichnet das quantitative Attribut ai, welches Werte v1, v2, ... im Intervall (l, r) aufnimmt (l be- zeichnte die linke und r die rechte Intervallgrenze). Ein ka- tegorisches Attribut, d.h. ein einelementiges Attribut, kann durch ein quantitatives Attribut repra¨sentiert werden, des- sen linke und rechte Intervallgrenze gleich sind, d.h. ai(l, l) bezeichnet das kategorische Attribut, welches nur den Wert a1 a2 a1 a2 t1 22 100 t6 21.5 0 t2 22.5 - t7 21 - t3 - 100 t8 - 100 t4 22.5 - t9 22 - t5 21.5 0 t10 22.5 100 Abbildung 1: Beispiel Transaktionen itemset freq supp {〈a1(20, 21.5]〉} 3 0.3 {〈a1(20, 23]〉} 8 0.8 {〈a2(0, 50]〉} 4 0.4 {〈a2(50, 100]〉} 4 0.4 {〈a1(20, 21.5]〉, 〈a2(0, 50]〉} 2 0.2 {〈a1(20, 23]〉, 〈a2(0, 50]〉} 2 0.2 {〈a1(20, 23]〉, 〈a2(50, 100]〉} 2 0.2 Abbildung 2: Beispiel Itemsets l repra¨sentiert. Das Attribut ai(l, r) wird im folgenden als Item bezeichnet. I bezeichnet die Menge aller Items. Die Menge X = {a1(l, r), ..., ak(l, r)} ⊆ I, with ai 6= aj bezeichnet ein Itemset (oder auch k-Itemset). D bezeichnet eine Menge von Transaktionen. Eine Transaktion d∆t ∈ D ist eine Menge von Attributwer- ten ai〈v〉 im Zeitintervall ∆t. Jede Transaktion d∆t = {ai〈v〉, ..., aj〈v〉} kann auf ein Itemset I abgebildet wer- den, d.h. fu¨r jeden Wert ai〈v〉 existiert ein Item ai(l, r) in I mit ai〈v〉 ∈ ai(l, r). Die Ha¨ufigkeit freq(X,D) eines Itemsets X bezeich- net die Anzahl der Transaktionen D im Zeitintervall ∆t, die auf das Itemset abgebildet werden ko¨nnen. Der supp(X,D) eines Itemsets ist definiert als der prozentuale Anteil an Transaktionen, welcher auf das Itemset entfa¨llt. U¨blicherweise sind nur ha¨ufige Itemsets von Interesse. Ein Itemset wird als ha¨ufig bezeichnet, wenn dessen Ha¨ufigkeit einen vordefinierten Schwellwert minsupp u¨berschreitet. Signifikanz und Informationsdichte Ziel des quantita- tiven Frequent Itemset Mining ist das Finden von zusam- menha¨ngenden Items, deren Informationsgehalt sich si- gnifikant von dem aller anderen Items unterscheidet. Ein quantitatives Item ist genau dann signifikant, wenn dessen Informationsdichte gro¨ßer ist, als die Informationsdichte der alternativen Items. Die Informationsdichte eines Items ai(l, r) ist dabei wie folgt definiert density(ai(l, r)) = freq(ai(l, r)) dist(l, r) (1) wobei dist(l, r) = abs(l−r) die Distanz zwischen den bei- den Intervallgrenzen des Items bezeichnet. Entsprechend dieser Definition wird ein Itemset I genau dann als signifi- kant bezeichnet, wenn alle Items dieses Itemsets signifikant sind. Generalisierung Ein Itemset Xˆ wird genau dann als Ge- neralisierung von X bezeichnet, wenn Xˆ aus den selben Items wie X besteht und es gilt ∀ai(l, r) ∈ X : ai(l, r) ∈ X∧ai(l′, r′) ∈ Xˆ ⇒ l′ ≤ l ≤ r ≤ r′ D.h. alle Transaktion, welche sich auf X abbilden lassen, ko¨nnen ebenso auf Xˆ abgebildet werden. Im folgenden soll ein kurzes Beispiel die eben beschriebenen Zusam- menha¨nge verdeutlichen. 106 Beispiel 1 Es wird ein Datenstrom angenommen be- stehend aus zwei Attributen a1 (Temperatur) und a2 (Be- wegung) angenommen. Im Zeitintervall ∆t wurden die in Tabelle 3 dargestellten 10 Transaktionen festgestellt. Wei- terhin werden die folgenden Items angenommen: a1(20, 21.5], a1(20, 23], a2(0, 50] und a2(50, 100] Basierend auf diesen (quantitativen) Items lassen sich u¨ber den in Tabelle 3 abgebildeten Transaktionen die in Tabel- le 3 dargestellten (quantitativen) Itemsets ermitteln. Die Tabelle zeigt die entsprechenden Ha¨ufigkeiten und den Support, den die einzelnen Itemsets aufweisen. Es las- sen sich die folgenden Dichten fu¨r die Items ermitteln: density(a1(20, 21.5])) = 2, density(a1(21.5, 23]) = 3.3 und density(a1(20, 23]) = 2.7. Bei dem einelementigen Itemset {〈a1(20, 23]〉} handelt es sich um eine Generalisie- rung der Itemsets {〈a1(20, 21.5]〉} und {〈a1(21.5, 23]〉}. 4 FP 2-Stream Die meisten in der Literatur zu findenden Verfahren [11; 2], welche quantitative Attribute betrachten, zerlegen den Wertebereich fu¨r ein Attribut in a¨quidistante Intervalle, welche anschließend derart miteinander kombiniert wer- den, dass sie den geforderten Kriterien (minimaler Support und Signifikanz) genu¨gen. Zu den wesentlichen Problemen dieser bottom-up Strategie za¨hlen dabei das Finden einer geeignete Zerlegung [11] bzw. das mo¨glichst effiziente ge- nerieren zusammenha¨ngender Items. Die Zerlegung gefolgt von einer kostenintensiven Item- Rekonstruktion ist auf Datenstro¨me nicht bzw. nur be- schra¨nkt anwendbar. Alle Verfahren, welche das Prin- zip der Kombination von Teilintervallen einsetzen, basie- ren dabei auf dem Apriori-Algorithmus, welcher mehrere Durchla¨ufe beno¨tigt und daher fu¨r die Analyse u¨ber Daten- stro¨men nur bedingt geeignet ist. Im folgenden soll der FP 2-Stream vorgestellt werden, ein Speicher-effizientes Verfahren, welches fu¨r die Analyse u¨ber Datenstro¨men geeignet ist. Die Itemsets werden beim FP 2-Stream in einem Prefix-Baum (a¨hnlich dem FP-Tree von Han et al. [5]) verwaltet. Das Ziel des vorgestellten Verfahrens ist anschließend die kontinuierliche Verfeine- rung der Items (top-down-Strategie) und den daraus aufge- bauten Itemsets mit dem Eintreffen neuer Transaktionen, so dass diese den geforderten Kriterien genu¨gen. Im weiteren soll zuna¨chst die Grundlegende Datenstruk- tur beschrieben werden, anschließend folgt eine Beschrei- bung der Algorithmen zum Einfu¨gen von Transaktionen und zum Optimieren der Datenstruktur. 4.1 Datenstruktur Der FP 2-Stream verwaltet die ha¨ufigsten Muster in ei- nem Prefix-Baum. Eine Header-Tabelle entha¨lt alle Item- sets, welche sich gegenwa¨rtig im Prefix-Baum befinden. Der Pfad von der Wurzel bis zu einem Knoten im Prefix- Baum repra¨sentiert ein Itemset. Zusa¨tzlich ist in jedem Knoten des Baumes ein Zeitfenster eingebettet, welches die Ha¨ufigkeiten in den letzten k Zeitra¨umen aufnimmt. Im FP 2-Stream werden hierzu gleitende Fenster eingesetzt. Weiterhin sind alle Knoten, welche das gleiche Item re- pra¨sentieren untereinander u¨ber Listen miteinander verbun- den. Das entsprechende Item der Header-Tabelle verweist dabei auf das erste Element dieser Liste. Jedes Item der Header-Tabelle entha¨lt zusa¨tzlich ein Equi-Width Histo- gramm, welches einen approximativen U¨berblick u¨ber die Ha¨ufigkeitsverteilung der zuletzt eingefu¨gten Werte in die Input: Menge von Transaktionen D Output: Menge von Ha¨ufigen Itemsets Initialisiere FP 2-Tree als leer;1 Stelle min und max fu¨r jedes Attribut ai fest und lege2 die entsprechenden knoten im FP 2-Tree an; Fu¨ge die Transaktionen von Batch b0 in FP 2-Tree ein;3 U¨bernehme Knotengrenzen und Ha¨ufigkeiten aus dem4 FP 2-Tree in den FP 2-Stream; while Batch bi, i > 0 do5 Initialisiere FP 2-Tree mit den Knoten und6 Intervallgrenzen des FP 2-Stream; Sortiere alle Transaktionen aus aktuellem Batch in7 FP 2-Tree ein (falls notwendig, fu¨ge neue Knoten hinzu bzw. erweitere existierende Knoten); U¨bertrage alle Knoten aus dem FP 2-Tree in den8 FP 2-Stream; Fu¨hre eventuell notwendige Split Operationen auf9 dem FP 2-Stream aus; Fu¨hre eventuell notwendige Merge Operationen10 auf dem FP 2-Stream durch; Lese alle ha¨ufigen Itemsets aus dem FP 2-Stream11 aus (FP 2-Growth); Algorithm 1: FP 2-Stream Algorithmus Items gibt. Abbildung 3 zeigt einen Beispiel-FP 2-Stream. Dem Beispiel liegen die in Tabelle 3 beschriebenen Trans- aktionen zugrunde. Das Attribut a2 wurde bereits einer Ver- feinerung unterzogen. 4.2 Einfu¨gen neuer Transaktionen Das Einfu¨gen neuer Transaktionen in den FP 2-Stream er- folgt Batchweise (ein Batch b bezeichnte die Zusammen- fassung einer Menge von |b| Transaktionen). Neue Trans- aktionen werden beim Einfu¨gen nicht direkt in den FP 2- Stream integriert, sondern zuvor in einen FP 2-Tree ein- gefu¨gt. Der FP 2-Tree entspricht dabei im wesentlichen dem FP 2-Stream, wobei die Knoten jedoch nicht u¨ber Zeitfenster verfu¨gen (der FP 2-Tree verwaltet lediglich die zu einem Batch geho¨renden Musterha¨ufigkeiten). Die Im- plementierung des FP 2-Tree erfolgt in Form von ”Schat- tenknoten“, welche in die Knoten des FP 2-Stream inte- griert werden. Somit fallen fu¨r das Anlegen des FP 2-Tree keine zusa¨tzliche Kosten an. Es muss lediglich beim ers- ten Einfu¨gen einer Transaktion in einen neuen Batch der entsprechende ”Schattenknoten“ angelegt werden. Analog dem FP-Stream wird das Einfu¨gen des ersten Batches b0 getrennt von der Behandlung aller weiteren Bat- ches betrachtet. Zum Zeitpunkt des Einfu¨gens von b0 in den FP 2-Stream ist kein Wissen u¨ber die genaue Verteilung der Stromdaten vorhanden bzw. es stehen lediglich die In- formationen aus dem ersten Batch zur Verfu¨gung. Zuna¨chst wird daher fu¨r jedes Attribute i ein Item ai(min,max) derart angelegt, dass min dem minimalen Wert in b0 und max dem maximalen Wert in b0 des Attributes ai im ers- ten Batch entspricht. Die Items werden anschließend nach der Ha¨ufigkeit ihres Auftretens im ersten Batch sortiert und in den FP 2-Tree eingefu¨gt. Ha¨ufige Items werden Wurzel- nah eingefu¨gt. Da im ersten Batch ein Attribut den kom- pletten Wertebereich eines Datenstromes u¨berdeckt, kann es ausschließlich durch das vollsta¨ndige Fehlen von Wer- ten innerhalb von Transaktionen zu Unterschieden in den Ha¨ufigkeiten der einzelnen Attribute kommen. Nachdem der erste Batch erfolgreich eingefu¨gt wurde, 107 {} Header Table 1 1 1 1 2 2 next item item list next intervall neighbor list 1 1 a1(20, 23) a2(50, 100) a2(0, 50] a2(50, 100):2 b0 b1 b0 b1 a2(50, 100):2a1(20, 23):4 b0 b1 a2(0, 50]:2 b0 b1 Abbildung 3: Beispiel FP 2-Stream werden alle weiteren Batches gleich behandelt und folgt nachstehendem Schema: • Es existiert bereits ein Knoten, welcher das Itemset re- pra¨sentiert. Die Frequenz des entsprechenden Knotens im FP 2-Tree wird um 1 inkrementiert. • Es existiert kein Knoten, welcher das Itemset re- pra¨sentiert. Es muss ein neuer Knoten im FP 2-Tree angelegt werden: – Der neu anzulegende Knoten wird sowohl auf der linken, als auch auf der rechten Seite von exis- tierenden Knoten eingeschlossen: Als Intervall- grenzen fu¨r den neuen Knoten werden die Gren- zen der benachbarten Knoten gewa¨hlt. D.h. es wird ein Knoten ai(r, l′) zwischen den beiden begrenzenden Knoten ai(l, r) und ai(l′, r′) an- gelegt. – Der einzufu¨gende Wert u¨ber- bzw. unterschreitet alle bisher eingefu¨gten Werte: Existiert ein Kno- ten im FP 2-Tree, welches noch keinem Knoten im FP 2-Stream entspricht und dessen Intervall- grenzen sich derart erweitern lassen, dass es den Wert aufnehmen kann, dann werden die Grenzen dieses Knotens angepasst und die Frequenz ent- sprechend inkrementiert. Andernfalls muss ein neuer Knoten mit dem neuen Wert als Minimum bzw. Maximum als Grenze angelegt werden. Wurden ausreichend Transaktionen in den FP 2-Tree ein- gefu¨gt, dann kann dieser in den FP 2-Stream integriert werden. Zuna¨chst werden hierzu alle Knoten des FP 2- Tree entfernt, welche dem vorab definierten Schwellwert  nicht genu¨gen. Anschließend werden die Ha¨ufigkeiten der einzelnen Knoten in den FP 2-Stream u¨bernommen. Hierzu werden in die Zeitfenster aller Knoten des FP 2- Stream neue Zeitslots eingefu¨hrt. Sollten im FP 2-Stream Knoten vorhanden sein, welche durch die Integration des FP 2-Tree keine Aktualisierung erfahren, so sind deren Ha¨ufigkeiten fu¨r diesen Zeitslot 0. Sind wa¨hrend der Ver- arbeitung eines Batches neue Knoten im FP 2-Tree hinzu- gekommen, so mu¨ssen diese ebenfalls in den FP 2-Stream u¨bernommen werden. Wurden die Transaktionen eines Batches erfolgreich in den FP 2-Stream eingefu¨gt, dann wird anschließend gepru¨ft, inwieweit sich Items verfeinern lassen um eine mo¨glichst hohe Informationsdichte zu erhalten. Die Verfei- nerung ist hierbei ein zweistufiger Prozess. In einem ersten Schritt werden die Items verfeinert, wenn deren Fu¨llgrad zu hoch ist. Der zweite Schritt ist das Mischen von Items, wenn sich diese generalisieren lassen. Beide Prozesse sol- len im weiteren beschrieben werden. 4.3 Item-Split Als wesentliches Kriterium fu¨r schlecht approximierte Items wird die Dichteverteilung innerhalb eines Items her- angezogen. Sind die in ein Item eingefu¨gten Werte un- gleichma¨ßig verteilt, dann wurden die Grenzen fu¨r dieses Item schlecht gewa¨hlt. Die Bestimmung der Ungleichver- teilung erfolgt dabei u¨ber die Schiefe den in der Header Ta- belle mitgefu¨hrten Equi-Width-Histogrammen. Die Schiefe eines Item ai(l, r) u¨ber einem Histogramm la¨sst sich wie folgt bestimmen: s(ai(l, r))n = max{h(ai(l, r), j)}nj=1 −min{h(ai(l, r), j)}nj=1∑n j=1 h(ai(l, r), j) wobei n die Anzahl der Buckets in den Histogrammen be- zeichnet. h(ai(l, r), j) bezeichnet die Ha¨ufigkeit im j-ten Bucket des Histogramms fu¨r das Item ai(l, r). U¨berschreitet die Schiefe innerhalb des Items ai(l, r) einen vorab definierten Schwellwert maxskew, d.h. s(ai(l, r)) > maxskew, dann wird das Item in zwei disjunkte Items gesplittet. O.B.d.A. werden im folgenden Items immer in Items mit links offenem Intervall zerlegt. Fu¨r den Split wird ein Median-basierter Ansatz verwen- dete, dessen Ziel die Beru¨cksichtigung der realen Dich- teverha¨ltnisse innerhalb eines Items ist. Hierzu wird der Median u¨ber die einzelnen Buckets der Histogramme be- stimmt. Der Split erfolgt anschließend auf Basis des Mittel- punktes des Median-Buckets, d.h. ai(l, r) wird in die bei- den Items ai(l,median/2] und ai(median/2, r) zerlegt, wobei median den Median bezeichnet. Bei einem Itemsplit mu¨ssen alle Knoten im Baum, wel- che dieses Item repra¨sentieren ebenfalls gesplittet wer- den. Sind unterhalb eines Knotens, welcher ein Item re- pra¨sentiert das gesplitted wird, weitere Knoten, so werden diese Teilba¨ume kopiert und als neue Teilba¨ume in die neu entstehenden Knoten eingebunden. Die Ha¨ufigkeit des zu splittenden Items wird beim U¨berfu¨hren in die neuen Items halbiert (die genaue Verteilung der Daten in den einzelnen Knoten ist unbekannt, weswegen der Einfachheit halber ein Gleichverteilung angenommen wird). Ein Beispiel soll den Itemsplit verdeutlichen. Beispiel 2 Die Ha¨ufigkeit des Items 〈a1(20, 23]〉 aus dem vorigen Beispiel genu¨gt den definierten Be- dingungen. Beim Itemsplit auf Basis der Intervall- Halbierung wird das Item in die beiden Subitems 〈a1(20, 21.5]〉 und 〈a1(21.5, 23]〉 geteilt. Es entstehen somit die beiden Itemsets {〈a1(20, 21.5]〉, 〈a2(50, 100]〉} und {〈a1(21.5, 23]〉, 〈a2(20, 100]〉} welche jeweils eine gescha¨tzte Ha¨ufigkeit von 2 besitzen. Abbildung 4 zeigt den entsprechenden FP 2-Stream aus Abbildung 3 nach dem Split des Items 〈a1(20, 23]〉. 108 11 Header Table 0.5 0.5 0.5 0.5 1 1 0.5 0.5 1 1 {} next item item list next intervall neighbor list 0.5 0.5 a1(20, 21.5]:2 b0 b1 a1(21.5, 23) a1(20, 21.5] a2(50, 100) a2(0, 50] a2(50, 100):1 b0 b1 a2(0, 50]:1 b0 b1 a1(21.5, 23):2 b0 b1 a2(50, 100):1 b0 b1 b0 b1 a2(50, 100):2 a2(0, 50]:1 b0 b1 Abbildung 4: Split node a1(20, 23] Eine beliebige Verfeinerung der Intervalle ist aus Gru¨nden einer effizienten Datenhaltung nicht sinnvoll. So- mit ergibt sich die Fragestellung inwieweit Intervalle ver- feinert werden. In [11] pra¨sentieren die Autoren eine Un- tersuchung zur Anzahl an maximal notwendige Partitio- nen (Basisintervallen), welche fu¨r die anschließende Re- kombination zu erzeugen sind und dabei einen mo¨glichst geringen Informationsverlust aufweisen. Als Maß fu¨r den durch die Generalisierung entstehenden Informationsver- lust fu¨hren sie die Partial-Completeness ein. Als K-Partial- Completeness wird dabei der maximale Support bezeichnet (als K-vielfaches des minimalen Support), den ein Itemset aufweisen darf. Basierend auf der Partial-Completeness ha- ben die Autoren gezeigt, dass im Fall einer Partitionierung in Basisintervalle gleicher Gro¨ße, maximal 2·αminsupp·(K−1) . Intervalle notwendig sind. α bezeichnet dabei die Anzahl an quantitativen Attributen. Daraus la¨sst sich die mini- male Intervallgro¨ße fu¨r jedes Datenstromattribut ai ab- leiten. Zum Zeitpunkt des ersten Erstellens des FP 2- Stream (Batch b0) wird fu¨r jedes Attribut ai(min,max) bestimmt. Damit ergibt sich die minimale Intervallgro¨ße minIntSize entsprechend minIntSize = |max−min| 2·α minsupp·(K−1) . (2) Beim spa¨teren Eintreffen von Transaktionen mit Werten, welche das rechts- bzw. linkseitige Extrema erweitern muss das minimale Intervall entsprechend angepasst werden. Zusa¨tzlich zu den Knoten mu¨ssen auch die in den Kno- ten eingebetteten Fenster angepasst werden. Jeder der bei- den durch den Split neu entstandenen Knoten erha¨lt hier- zu die Ha¨lfte der Ha¨ufigkeitswerte des Original-Items. Um spa¨ter beim FP 2-Growth auf einfache Weise das bisherige Split-Verhalten von Items nachvollziehen zu ko¨nnen wird bei einem Split in jedem Zeitfensterslot ein Verweis auf das Fenster angelegt, das bei dem Split abgeteilt wurde. Um weitere Splits bzw. Merges mo¨glichst effizient gestal- ten zu ko¨nnen verweist das letzte Fenster in der Liste auf das erste Listenelement, so dass ein Ring entsteht. Der Ring ermo¨glicht anschließend eine schnelle Suche auch ohne die Verwendung einer doppelt verketteten Liste. 4.4 Item-Merge Items, welche nahezu die gleiche Informationsdichte auf- weisen, ko¨nnen ohne weiteren Informationsverlust kombi- niert (generalisiert) werden. Die Generalisierung von Items fu¨hrt somit zu einer kompakteren Datenstruktur. Zwei Items im FP 2-Stream ai(l, r) und ai(l′, r′) ko¨nnen genau dann zu dem Item ai(l, r′) zusammenge- fasst werden, wenn fu¨r alle Knoten die diese Items re- pra¨sentieren folgenden Bedingungen erfu¨llt werden: 1. Die Items ai(l, r) und ai(l′, r′) sind direkte Nachbarn, d.h. es gilt r = l′. 2. Benachbarte Knoten, welche die Items ai(l, r) und ai(l ′, r′) repra¨sentieren besitzen den gleichen Pra¨fix, d.h. sie verfu¨gen u¨ber den gleichen Elternknoten im FP 2-Stream. 3. density(ai(l, r)) ∼ density(ai(l′, r′)) 1. stellt sicher, dass es sich bei den Verbundkandidaten um Intervallnachbarn handelt. Ausschließlich direkte Nach- barn ko¨nnen miteinander verbunden werden. 2. garantiert, dass alle von dem Item-Merge betroffenen Itemsets eben- falls verbunden werden ko¨nnen. 3. stellt sicher, dass die zu verbindenden Items u¨ber nahezu den gleichen Informati- onsgehalt verfu¨gen, d.h. das es durch die Generalisierung zu keinem Informationsverlust kommt. Algorithmus 1 fu¨hrt den Merge umgehend nach dem Split aus. Wurde ein Split auf einem Item durchgefu¨hrt, dann verfu¨gen die beiden resultierenden Items u¨ber die gleiche Dichte. Erst durch das Einfu¨gen neuer Batches setzt sich die Schiefe, welche zuvor fu¨r den Split notwendiger- weise festgestellt wurde, auch in den neuen Items durch. Aus diesem Grund ist eine sofortige Rekombination zuvor erst geteilter Items fu¨r die na¨chsten k Batches nicht sinn- voll. 4.5 Prunning und Reorganisation Durch das Splitten von Items wird der FP 2-Stream kon- tinuierlich vergro¨ßert. So resultiert zum Beispiel der Split des Wurzelknotens in einer Verdoppelung aller Knoten im Baum. Um diesen dennoch mo¨glichst klein und damit die Verarbeitung effizient zu gestalten werden zwei verschie- dene Ansa¨tze verfolgt: • (i) Itemsets, welche nicht mehr ha¨ufig sind und in den na¨chsten Zeitschritten auch nicht mehr ha¨ufig werden ko¨nnen, werden aus der Datenstruktur entfernt wer- den. • (ii) Es erfolgt eine Reorganisation der Baumstruktur um eine mo¨glichst hohe Kompaktheit zu gewa¨hren (a¨hnlich dem CPT [12]). Item-Rekonstruktion und Itemset Exktraktion Zwar werden im FP 2-Stream ha¨ufige Itemsets in einer kompakten Darstellung gespeichert, es wird aber nicht ga- rantiert, dass die Extraktion der ha¨ufigen Muster effizient geschieht. Das Herauslo¨sen der Frequent Pattern stellt da- bei ein kombinatorisches Problem dar. Mit der Entwick- 109 lung des FP-Tee pra¨sentierten Han et al. in [5] den FP- Growth, ein Verfahren zur Extraktion aller im FP-Tree ge- speicherten ha¨ufigen Muster. Im Folgenden wird ein fu¨r den FP 2-Stream angepasstes Growth-Verfahren zum ex- trahieren der ha¨ufigen Muster beschrieben. Bevor die ha¨ufigen Itemsets extrahiert werden ko¨nnen sind zwei wesentliche Verarbeitungsschritte notwendig. In einem ersten Schritt wird aus dem FP 2-Stream ein FP 2- Tree extrahiert. Hierzu wird die Summe u¨ber alle in dem Fenster eines Knotens enthaltenen Ha¨ufigkeiten als aktu- eller Wert fu¨r einen Knoten des FP 2-Tree herangezogen. Aufgrund des kontinuierlichen Teilens der Knoten, unter Annahme einer Gleichverteilung der einzelnen Werte in den Intervallen, handelt es sich bei den Ha¨ufigkeiten in den Knoten des FP 2-Stream zumeist nur um approximier- te Werte. Lediglich Knoten, welche zuvor nicht gesplit- tet wurden, weißen genaue Ha¨ufigkeitswerte auf. Um dies auch fu¨r zuvor geteilte Knoten zu erreichen sind die Fens- terslots aller Knoten, welche urspru¨nglich von dem sel- ben Knoten abstammen untereinander u¨ber Ringlisten mit- einander verbunden (siehe Item-Split). Unter Verwendung dieser kann die genaue Ha¨ufigkeit fu¨r den urspru¨nglichen Knoten wiederhergestellt werden (die Summe u¨ber alle Knoten einer Ringliste). Der durch den Extraktionsprozess erzeugte FP 2-Tree entha¨lt im allgemeinen quantitative Items, deren Support nicht minsupp genu¨gt. Im na¨chsten Schritt werden durch die Rekombination von benachbarten Intervallen Items er- zeugt, welche minsupp genu¨gen. DIn der Literatur las- sen sich verschiedene Interessantheitsmaße fu¨r das Erzeu- gen der Items finden [10; 13]. Beim FP 2-Stream bzw. FP 2-Tree findet die Informationsdichte der Items Verwen- dung. D.h. das Item mit der ho¨chsten Informationsdichte wird so lange mit benachbarten Items (dem jeweils dich- testen direkt benachbarten Item) verbunden, bis der gefor- derte minimale Support erreicht wird. Werden per Defini- tion mehr als ein ha¨ufiges Item pro Datenstromattribut ge- fordert, dann wird dieses Verfahren auf die verbliebenen nicht-ha¨ufigen Items angewandt. Nachdem der FP 2-Tree erzeugt wurde und die Items den definierten Bedingungen genu¨gen, ko¨nnen aus diesem ha¨ufige Itemsets extrahiert werden. Aus Gru¨nden der effizi- enten Auswertung wird hierzu auf das Top-Down Verfahren nach [14] zuru¨ckgegriffen. 5 Evaluierung Im folgenden soll die Funktionsweise des vorgestellten Verfahrens gezeigt werden. Zu diesem Zweck wurde das Verfahren in Form eines Operators in das Datenstrom- Managementsystem AnduIN [6] integriert. Neben einer Vielzahl an einfachen Operationen (Filter, Projektion, Ver- bund, Aggregation) unterstu¨tzt AnduIN zusa¨tzlich auch die Integration komplexer Operatoren in Form von Synopsen- Operatoren. Unter einem Synopsen-Operator wird dabei ein Operator verstanden, welcher sowohl u¨ber eine In- Memory Datenzusammenfassung, als auch u¨ber notwen- dige Algorithmen fu¨r das Einfu¨gen von Daten und deren Auswertung verfu¨gt. Das hier pra¨sentierte Verfahren wur- de als ein solcher Synopsenoperator integriert. Zuna¨chst soll die grundlegende Funktion des Verfahrens pra¨sentiert werden. Hierzu wurden 3 Datenstro¨me zu je 5000 Datenpunkten erzeugt. Ein Datenstrom entspricht da- bei den Daten eines Attributes, d.h. die Analyse erfolgt im weiteren u¨ber Attributen. Die erzeugten Werte sind dabei standardnormalverteilt. Abbildung 5(a) zeigt exemplarisch den erzeugten Datenstrom fu¨r ein Attribut. Aus diesen drei Datenstro¨men sollen nun zusam- menha¨ngende Itemsets extrahiert werden, wobei pro At- tribut ein Item erzeugt werden soll. Im ersten Test sol- len Items bzw. Itemsets extrahiert werden, deren Support minsupp mindesten 0.25 ist. Es wurde mit einer Batch- gro¨ße |b| = 100, einer Fenstergro¨ße N von 5, sowie Histo- grammen mit 10 Buckets d und einer maximalen Schiefe maxskew von 0.2 getestet. Abbildung 5(b) zeigt beispiel- haft die zeitliche Entwicklung des u¨ber dem Datenstrom aus Abbildung 5(a) entwickelten Items. Graue Buckets repra¨sentieren Items (pro Zeitschritt jeweils eine zusam- menha¨ngende Box). Die schwarzen Rahmen entsprechen den Knoten im FP 2-Stream, die u¨ber diesem Attribut existieren. Weiß gerahmte Buckets sind somit Knoten im FP 2-Stream, welche nicht ha¨ufig sind. Graue Balken, die aus mehreren Buckets bestehen sind im FP 2-Stream u¨ber mehrere Knoten verteilt. Abbildung 5(b) zeigt sehr gut, dass zu Beginn der Analy- se der vollsta¨ndige Wertebereich durch das Item u¨berdeckt wird. Anschließend fa¨ngt der Algorithmus an, den alles u¨berdeckenden Knoten in mehrere Teilknoten zu zerlegen. Aufgrund des Zeitfensters von 5 Batches setzt sich diese Verfeinerung allerdings erst nach 600 Zeiteinheiten durch. Anschließend schwankt das erzeugte Item um den Mittel- wert der Standardnormalverteilten Daten. Elementar fu¨r die Menge an Knoten im FP 2-Stream und fu¨r die Konstruktion der Items bzw. Itemsets ist das Maß der Informationsdichte. Abbildung 5(c) zeigt fu¨r den betrachteten Beispieldatenstrom die Dichte der einzelnen Buckets. Die Bucketgrenzen entsprechen denen aus Abbil- dung 5(b). Die Abbildung zeigt die mit steigender Knoten- verfeinerung zunehmende Informationsdichte. Die Synopse des FP 2-Stream ist eine Datenstruktur zur effizienten Speicherung von Transaktionen. Das kontinu- ierliche Verfeinern von Knoten hat allerdings starken Ein- fluss auf die Gro¨ße der Baumstruktur. So fu¨hrt zum Bei- spiel ein Splitt in der Wurzel zu einer Verdoppelung aller abha¨ngigen Knoten. Wa¨hrend der initialen Phase muss sich das Verfahren erst an die Daten anpassen. Dies kann eine Vielzahl an Split und Merge Operationen zur Folge haben. In Abbildung 5(d) ist die Zeitliche Entwicklung zu obigen Beispiel dargestellt. Sehr gut zu erkennen ist die Spitze zu Beginn des Verarbeitungsprozesses. Abbildung 5(d) zeigt außerdem die Anzahl an Knoten pro Attribut im Baum. Erwartungsgema¨ß steigt diese Zahl mit der Tiefe des Baumes, so dass das Attribut, welches sich auf Blattebene befindet (im Beispiel Attribut 3) durch die ho¨chste Anzahl an Knoten repra¨sentiert wird. Durch das Rekombinieren von Items innerhalb desFP 2- Stream werden automatisch auch die entsprechenden Item- sets zusammengefu¨hrt. Wa¨hrend der Itemset-Exktraktion werden alle Items aus der Datenstruktur herausgelo¨st, wel- che dem minimalen Support genu¨gen. Im Beispiel der standardnormalverteilten Daten sind hierbei zusa¨tzlich 2- Itemsets bzw. vereinzelt auch 3-Itemsets extrahiert wurden. Die Bilder in Abbildung 6 zeigen die zeitliche Entwicklung des 2-Itemsets u¨ber den Attributen 0 und 2. Jedes der ein- zelnen Bilder entspricht dabei dem am Ende eines Batches extrahiertem Itemset. Wiederum sehr gut zu erkennen ist die initiale Phase. Außerdem sehr gut zu erkennen ist das entfernen von Buckets am Rand infolge zu weniger Wer- te in diesen Regionen. Das entfernen von uninteressanten Knoten ist einer der Gru¨nde, warum sich die Anzahl an Knoten im FP 2-Stream nach dem initialen Anstieg auf ei- 110 -4 -3 -2 -1 0 1 2 3 4 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 W er t Zeit (a) Daten -4 -3 -2 -1 0 1 2 3 4 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 W er t Zeit (b) Item -4 -3 -2 -1 0 1 2 3 4 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 W er t Zeit (c) Dichteverteilung 0 200 400 600 800 1000 1200 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 An za hl K no te n Zeit Anzahl Knoten gesamt Knoten Attr 1 Knoten Attr 2 Knoten Attr 3 (d) Anzahl Knoten Abbildung 5: Standardnormalverteilte Daten -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 At trI D 2 AttrID 0 Zeit 100 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 At trI D 2 AttrID 0 Zeit 200 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 At trI D 2 AttrID 0 Zeit 500 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 At trI D 2 AttrID 0 Zeit 600 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 At trI D 2 AttrID 0 Zeit 700 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 At trI D 2 AttrID 0 Zeit 800 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 At trI D 2 AttrID 0 Zeit 900 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 At trI D 2 AttrID 0 Zeit 1000 Abbildung 6: Entwicklung des Itemsets s1 und s2 nem deutlich niedrigeren Level einpendelt. Fu¨r den na¨chsten Test wurden wiederum drei Daten- stro¨me erzeugt. Jeder Datenstrom umfasst dabei 5000 Da- tenpunkte und folgt einem Trend. Zusa¨tzlich wurden die Daten mit normalverteiltem Rauschen u¨berlagert. Abbil- dung 7(a) zeigt den zeitlichen Verlauf der drei Datensa¨tze. Das Experiment wurde wiederum mit |b| = 100, N = 5 und einer maximalen Schiefe maxskew von 0.2 bei 10 Histogrammbuckets durchgefu¨hrt. Der geforderte Support betrug 0.15. Abbildung 7(b) zeigt exemplarisch die Ent- wicklung der Items fu¨r ein Attribut. Die notwendige Ein- schwingphase ist deutlich zu erkennen. Im weiteren Ver- lauf entwickelt sich das Item mit dem Trend, wobei fu¨r den geforderten minimalen Support mehr oder weniger viele Knoten kombiniert werden mu¨ssen. Interessant ist die Ent- wicklung des Items um den Zeitpunkt 1000. Hier sorgen die neu eintreffenden Werte offensichtlich fu¨r ein oszillie- ren zwischen zwei Items, welches sich erst mit dem Teit- punkt 1300 durchsetzen kann. In Abbildung 7(c) ist die Informationsdichte- Entwicklung des entsprechenden Attributes dargestellt und Abbildung 7(d) zeigt zeitliche Entwicklung der Knotenanzahl im FP 2-Stream. Trotz des kontinuierlichen Trends entwickelt sich die Gesamtknotenanzahl anschlie- ßend relativ konstant bei ca. 200 Knoten. Beginnend ab Zeitpunkt 3300 wa¨chst die Knotenanzahl erneut drastisch an. Die Ursache hierfu¨r liegt offensichtlich in einer A¨nderung der Datencharakteristik wa¨hrend dieser Zeit bei Attribut 3. Trotz dem, dass es sich um trenbehaftet Daten mit normalverteiltem Rauschen handelt, scheint zwischen den Zeitpunkten 3000 und 3500 eine weitere U¨berlagerung (a¨hnlich einem Burst) aufzutreten. Diese plo¨tzliche Vera¨nderung hat entsprechend Auswirkungen auf die Knoten und deren Dichten (siehe Abbildung 7(c)), welche Attribut 3 repra¨sentieren und fu¨hren vermutlich zu dem Peak. Diese ersten Ergebnisse zeigen, dass das Verfahren des quantitativen FP-Mining fu¨r die Analyse von Datenstro¨men prinzipiell geeignet ist. Es wurde gezeigt, dass das ein- gefu¨hrte Maß der Informationsdichte ist fu¨r Erzeugen in- 111 -5 0 5 10 15 20 25 30 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 W er t Zeit Attribut 1 Attribut 2 Attribut 3 (a) Daten 8 10 12 14 16 18 20 22 24 26 28 30 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 W er t Zeit (b) Item 8 10 12 14 16 18 20 22 24 26 28 30 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 W er t Zeit (c) Dichteverteilung 0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 An za hl K no te n Zeit Anzahl Knoten gesamt Knoten Attr 1 Knoten Attr 2 Knoten Attr 3 (d) Anzahl Knoten Abbildung 7: Trendbehaftete Daten teressanter Itemsets. Erst dieses ermo¨glicht die Analyse u¨ber Datenstro¨men, da es sowohl fu¨r das Teilen und Ver- binden von Intervallen, als auch fu¨r die Rekombination als Schritt der Itemsetextraktion herangezogen wird. Mit dem pra¨sentierten Maß kann auch das Testen aller mo¨glichen In- tervallkombinationen verzichtet werden, was einer der we- sentlichen Gru¨nde fu¨r die Eignung zur Analyse von Daten- stro¨men ist. 6 Zusammenfassung Das FP-Mining u¨ber statischen Daten mit kategorischen Werte za¨hlt heute zu den klassischen Data Mining Verfah- ren, fu¨r welches eine Vielzahl an Lo¨sungen in der Literatur zu finden sind. In der hier vorgestellten Arbeit wurde ein neuer Ansatz fu¨r die Identifikation von ha¨ufigen Mustern u¨ber quantitative Attribute aus Datenstro¨men pra¨sentiert. Das pra¨sentierte Verfahren kombiniert hierzu bekannte Techniken aus dem Data Mining Bereich mit Verfahren der mehrdimensionalen Indexstrukturen. Neben den notwendi- gen Verarbeitungsschritten wurde die prinzipielle Funktion anhand von Beispielen gezeigt. In weiteren Arbeiten soll die Funktion des Verfahrens mit realen Szenarien unter- sucht und evaluiert werden. Das DSMS AnduIN erlaubt die Teilweise Auslagerung von Funktionalita¨t in Wireless Sensor Netzwerke. Eine der zuku¨nftigen Arbeiten ist da- her die Migration des vorgestellten Verfahrens hin zu ei- nem In-Network-Operator von AnduIN mit dem Ziel einer mo¨glichst effizienten Verarbeitung. Literatur [1] R. Agrawal, T. Imielinski, and A. N. Swami. Mining Asso- ciation Rules between Sets of Items in Large Databases. In ’93, Washington, D.C., USA, pages 207–216, 1993. [2] Y. Aumann and Y. Lindell. A statistical theory for quantita- tive association rules. In Journal of Intelligent Information Systems, pages 261–270, 1999. [3] T. Fukuda, Y. Morimoto, Sh. Morishita, and T. Tokuyama. Mining optimized association rules for numeric attributes. In PODS ’96, pages 182–191. ACM, 1996. [4] M. Halatchev and Le Gruenwald. Estimating missing values in related sensor data streams. In COMAD, pages 83–94, 2005. [5] J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation. In 2000, Dallas, USA, pages 1–12, 2000. [6] D. Klan, K. Hose, M. Karnstedt, and K. Sattler. Power- aware data analysis in sensor networks. In ICDE ’10, Long Beach, California, USA, 2010. IEEE, IEEE. [7] C. K.-S. Leung and Q. I. Khan. Dstree: A tree structure for the mining of frequent sets from data streams. In ICDM ’06, pages 928–932, Washington, DC, USA, 2006. IEEE Com- puter Society. [8] G. S. Manku and R. Motwani. Approximate Frequency Counts over Data Streams. In 2002, Hong Kong, China, pages 346–357, 2002. [9] J. S. Park, M.-S. Chen, and P. S. Yu. An effective hash- based algorithm for mining association rules. In SIGMOD ’95, pages 175–186, New York, NY, USA, 1995. ACM. [10] G. Piatetsky-Shapiro. Discovery, analysis and presentation of strong rules. 1991. [11] R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. SIGMOD Rec., 25(2):1–12, 1996. [12] S. K. Tanbeer, Ch. F. Ahmed, B.-S. Jeong, and Y.-K. Lee. Efficient frequent pattern mining over data streams. In CIKM ’08, pages 1447–1448, New York, NY, USA, 2008. ACM. [13] A. Tuzhilin. On subjective measures of interestingness in knowledge discovery. In KDD ’95, pages 275–281. AAAI Press, 1995. [14] K. Wang, L. Tang, J. Han, and J. Liu. Top down fp-growth for association rule mining. In PAKDD ’02, pages 334–340, London, UK, 2002. Springer-Verlag. 112 A Novel Multidimensional Framework for Evaluating Recommender Systems Artus Krohn-Grimberghe Information Systems and Machine Learning Lab University of Hildesheim, Germany artus@ismll.de Alexandros Nanopoulos Information Systems and Machine Learning Lab University of Hildesheim, Germany nanopoulos@ismll.de Lars Schmidt-Thieme Information Systems and Machine Learning Lab University of Hildesheim, Germany schmidt-thieme@ismll.de ABSTRACT The popularity of recommender systems has led to a large variety of their application. This, however, makes their eval- uation a challenging problem, because different and often contrasting criteria are established, such as accuracy, robust- ness, and scalability. In related research, usually only con- densed numeric scores such as RMSE or AUC or F-measure are used for evaluation of an algorithm on a given data set. It is obvious that these scores are insufficient to measure user satisfaction. Focussing on the requirements of business and research users, this work proposes a novel, extensible framework for the evaluation of recommender systems. In order to ease user-driven analysis we have chosen a multidimensional ap- proach. The research framework advocates interactive vi- sual analysis, which allows easy refining and reshaping of queries. Integrated actions such as drill-down or slice/dice, enable the user to assess the performance of recommenda- tions in terms of business criteria such as increase in revenue, accuracy, prediction error, coverage and more. The ability of the proposed framework to comprise an ef- fective way for evaluating recommender systems in a business- user-centric way is shown by experimental results using a research prototype. Keywords Recommender Systems, Recommendation, Multidimensional Analysis, OLAP, Exploratory Data Analysis, Performance Analysis, Data Warehouse 1. INTRODUCTION The popularity of recommender systems has resulted in a large variety of their applications, ranging from presenting personalized web-search results over identifying preferred multimedia content (movies, songs) to discovering friends in social networking sites. This broad range of applications, however, makes the evaluation of recommender systems a Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. challenging problem. The reason is the different and often contrasting criteria that are being involved in real-world ap- plications of recommender systems, such as their accuracy, robustness, and scalability. The vast majority of related research usually evaluates recommender system algorithms with condensed numeric scores: root mean square error (RMSE) or mean absolute error (MAE) for rating prediction, or measures usually stem- ming from information retrieval such as precision/recall or F-measure for item prediction. Evidently, although such measures can indicate the performance of algorithms regard- ing some perspectives of recommender systems’ applications, they are insufficient to cover the whole spectrum of aspects involved in most real-world applications. As an alternative approach towards characterizing user experience as a whole, several studies employ user-based evaluations. These stud- ies, though, are usually rather costly, difficult in design and implementation. More importantly, when recommender systems are de- ployed in real-world applications, notably e-commerce, their evaluation should be done by business analysts and not nec- essarily by recommender-system researchers. Thus, the eval- uation should be flexible on testing recommender algorithms according to business analysts’ needs using interactive queries and parameters. What is, therefore, required is to pro- vide support for evaluation of recommender systems’ perfor- mance based on popular online analytical processing (OLAP) operations. Combined with support for visual analysis, ac- tions such as drill-down or slice/dice, allow assessment of the performance of recommendations in terms of business objec- tives. For instance, business analysts may want to examine various performance measures at different levels (e.g., hierar- chies in categories of recommended products), detect trends in time (e.g., elevation of average product rating following a change in the user interface), or segment the customers and identify the recommendation quality with respect to each customer group. Furthermore, the interactive and visual nature of this process allows easy adaptation of the queries according to insights already gained. In this paper, we propose a novel approach to the eval- uation of recommender systems. Based on the aforemen- tioned motivation factors, the proposed methodology builds on multidimensional analysis, allowing the consideration of various aspects important for judging the quality of a rec- ommender system in terms of real-world applications. We describe a way for designing and developing the proposed ex- tensible multidimensional framework, and provide insights 113 into its applications. This enables integration, combination and comparison of both, the presented and additional, mea- sures (metrics). To assess the benefits of the proposed framework, we have implemented a research prototype and now present experi- mental results that demonstrate its effectiveness. Our main contributions are summarized as follows: • A flexible multidimensional framework for evaluating recommender systems. • A comprehensive procedure for efficient development of the framework in order to support analysis of both, dataset facets and algorithms’ performance using in- teractive OLAP queries (e.g., drill-down, slice, dice). • The consideration of an extended set of evaluation measures, compared to standards such as the RMSE. • Experimental results with intuitive outcomes based on swift visual analysis. 2. RELATEDWORK For general analysis of recommender systems, Breese [5] and Herlocker et al. [11] provide a comprehensive overview of evaluation measures with the aim of establishing compa- rability between recommender algorithms. Nowadays, the generally employed measures within the prevailing recom- mender tasks are MAE, (R)MSE, precision, recall, and F- measure. In addition further measures including confidence, coverage and diversity related measures are discussed but not yet broadly used. Especially the latter two have at- tracted attention over the last years as it is still not certain whether today’s predictive accuracy or precision and recall related measures correlate directly with interestingness for a system’s end users. As such various authors proposed and argued for new evaluation measures [22, 21, 6]. Ziegler [22] has analyzed the effect of diversity with respect to user sat- isfaction and introduced topic diversification and intra-list similarity as concepts for the recommender system commu- nity. Zhang and Hurley [21] have improved the intra-list similarity and suggested several solution strategies to the di- versity problem. Celma and Herrera [6] have addressed the closely related novelty problem and propose several techni- cal measures for coverage and similarity of item recommen- dation lists. All these important contributions focus on re- porting single aggregate numbers per dataset and algorithm. While our framework can deliver those, too, it goes beyond that by its capability of combining the available measures and, most importantly, dissecting them among one or more dimensions. Analysis of the end users’ response to recommendations and their responses’ correlation with the error measures used in research belongs to the field of Human-Recommender In- teraction. It is best explored by user studies and large scale experiments, but both are very expensive to obtain and thus rarely conducted and rather small in scale. Select studies are [13, 14, 4]. Though in the context of classical information retrieval, Joachims et al [13] have conducted a highly rele- vant study on the biasing effect of the position an item has within a ranked list. In the context of implicit feedback vs. explicit feedback Jones et al [14] have conducted an impor- tant experiment on the preferences of users concerning rec- ommendations generated by unobtrusively collected implicit feedback compared to recommendations based on explicitly stated preferences. Bollen et al. [4] have researched the ef- fect of recommendation list length in combination with rec- ommendation quality on perceived choice satisfaction. They found that for high quality recommendations, longer lists tend to overburden the user with difficult choice decisions. Against the background of those results we believe that for initial research on a dataset, forming an idea, checking if certain effects are present, working on collected data with a framework like the one presented is an acceptable proxy. With findings gained in this process, conducting meaningful user studies is an obvious next step. Recent interesting findings with respect to dataset charac- teristics are e.g. the results obtained during the Netflix chal- lenge [3, 17] on user and item base- effects and time-effects in data. When modeled appropriately, they have a noteworthy effect on recommender performance. The long time it took to observe these properties of the dataset might be an indi- cator for the fact that with currently available tools proper analysis of the data at hand is more difficult and tedious than it should be. This motivates the creation of easy-to- use tools enabling thorough analysis of the datasets and the recommender algorithm’s results and presenting results in an easy to consume way for the respective analysts. Notable work regarding the integration of OLAP and rec- ommender systems stems from the research of Adomavicius et al. [2, 1]. They treat the recommender problem setting with its common dimensions of users, items, and rating as in- herently multidimensional. But unlike this work, they focus on the multidimensionality of the generation of recommen- dations and on the recommenders themselves being multi- dimensional entities that can be queried like OLAP cubes (with a specifically derived query language, RQL). In con- trast, our work acknowledges the multidimensional nature of recommender systems, but focusses on their multidimen- sional evaluation. Existing frameworks for recommender systems analysis usually focus on the automatic selection of one recommenda- tion technique over another. E.g., [10] is focussed on an API that allows retrieval and derivation of user satisfaction with respect to the recommenders employed. The AWESOME system by Thor and Rahm [20], the closest approach to that presented here, shares the data warehousing approach, the description of the necessary data preparation (ETL), and the insight of breaking down the measures used for recom- mender performance analysis by appropriate categories. But contrary to the approach presented here, the AWESOME framework is solely focussed on website performance and re- lies on static SQL-generated reports and decision criteria. Furthermore, it incorporates no multidimensional approach and does not aim at simplifying end-user-centric analysis or interactive analysis at all. 3. FRAMEWORK REQUIREMENTS 3.1 The Role of a Multidimensional Model Business analysts expect all data of a recommender sys- tems (information about items, generated recommendations, user preferences, etc.) to be organized around business enti- ties in form of dimensions and measures based on a multidi- mensional model. A multidimensional model enforces struc- ture upon data and expresses relationships between data elements [19]. Such a model, thus, allows business analysts 114 Figure 1: The recommender evaluation framework. The dimensions specified are connected with both fact table groups (dashed boxes in the center) and are thus available in both resulting cubes. End users can connect to the Rating Prediction and Item Rec- ommendation cubes. to investigate all aspects of their recommender system by using the popular OLAP technology [7]. This technology provides powerful analytical capabilities that business an- alysts can query to detect trends, patterns and anomalies within the modeled measures of recommender systems’ per- formance across all involved dimensions. Multidimensional modeling provides comprehensibility for the business analysts by organizing entities and attributes of their recommender systems in a parent-child relationship (1:N in databases terminology), into dimensions that are identified by a set of attributes. For instance, the dimension of recommended items may have as attributes the name of the product, its type, its brand and category, etc. For the business analyst, the attributes of a dimension represent a specific business view on the facts (or key performance indi- cators), which are derived from the intersection entities. The attributes of a dimension can be organized in a hierarchical way. For the example of a dimension about the user of the recommender systems, such a hierarchy can result from the geographic location of the user (e.g., address, city, or coun- try). In a multidimensional model, the measures (sometimes called facts) are based in the center with the dimensions sur- rounding them, which forms the so called star schema that can be easily recognized by the business analysts. The star schema of the proposed framework will be analyzed in the following section. It is important to notice that aggregated scores, such as the RMSE, are naturally supported. Nevertheless, the power of a multidimensional model resides in adding further de- rived measures and the capability of breaking all measures down along the dimensions defined in a very intuitive and highly automated way. 3.2 Core Features Organizing recommender data in a principled way pro- vides automation and tool support. The presented frame- work enables analysis of all common recommender datasets. It supports both rating prediction and item recommenda- tion scenarios. Besides that, data from other application sources can and should be integrated for enriched analysis capabilities. Notable sources are ERP systems, eCommerce systems and experimentation platform systems employing recommender systems. Their integration leverages analysis of the recommender data by the information available within the application (e.g., recommender performance given the respective website layouts) and also analysis of the appli- cation data by recommender information (e.g., revenue by recommender algorithm). Compared to RMSE, MAE, precision, recall, and F-measure, more information can be obtained with this framework as, first, additional measures e.g. for coverage, novelty, diver- sity analysis are easily integrated and thus available for all datasets. Second, all measures are enhanced by the respec- tive ranks, (running) differences, (running) percentages, to- tals, standard deviations and more. While a single numerical score assigned to each recom- mender algorithm’s predictions is crucial for determining winners in challenges or when choosing which algorithm to deploy [8], from an business insight point of view a lot of in- teresting information is forgone this way. Relationships be- tween aspects of the data and their influence on the measure may be hidden. One such may be deteriorating increase in algorithmic performance with respect to an increasing num- ber of rating available per item, another the development of the average rating over the lifetime of an item in the prod- uct catalog. A key capability of this framework is exposing intuitive ways for analyzing the above measures by other measures or related dimensions. From a usability point of view, this framework contributes convenient visual analysis empowering drag-drop analysis and interactive behavior. Furthermore, convenient visual presentation of the obtained results is integrated from the start as any standard conforming client can handle it. Man- ual querying is still possible as is extending the capabilities of the framework with custom measures, dimensions, or func- tions and post-processing of received results in other appli- cations. Inspection of the original source data is possible via custom actions which allow the retrieval of the source rows that produced the respective result. Last but not least, ag- gregations allow for very fast analysis of very large datasets, compared to other tools. The following section elaborates on the architecture of the multidimensional model that is used by the proposed frame- work, by providing its dimensions and measures. 4. THE ARCHITECTUREOF THEMULTI- DIMENSIONAL FRAMEWORK Figure 1 gives an overview of the architecture of the frame- work. The source data and the extract-transform-load (ETL) process cleaning it and moving it into the data store are lo- cated at the bottom of the framework. The middle tier stores the collected information in a data warehouse manner regarding facts (dashed boxes in the center) and dimensions (surrounding the facts). The multidimensional cubes (for rating recommendation and item prediction) sitting on top of the data store provide access to an extended set of mea- sures (derived from the facts in the warehouse) that allow automatic navigation along their dimensions and interaction with other measures. 4.1 The Data Flow The data gathered for analysis can be roughly divided into two categories: 115 Core data: consisting of the algorithms’ training data, such as past ratings, purchase transaction information, on- line click streams, audio listening data, ... and the persisted algorithms’ predictions. Increase-insight data: can be used as a means to lever- age the analytic power of the framework. It consists roughly of user master data, item master data, user transactional statistics, and item transactional statis- tics. This data basically captures the metadata and usage statistics data not directly employed by current recommender algorithms (such as demographic data, geographic data, customer performance data. . . ). In case of recommender algorithms employed in produc- tion environments, relational databases housing the trans- actional system (maybe driving an e-commerce system like an ERP system or an online shop) will store rich business master data such as item and user demographic information, lifetime information and more, next to rating information, purchase information, and algorithm predictions. In case of scientific applications, different text files containing e.g. rat- ing information, implicit feedback, and the respective user and item attributes for training and the algorithms’ predic- tions are the traditional source of the data. From the respective source, the master data, the transac- tional data, and the algorithm predictions are cleaned, trans- formed, and subsequently imported into a data warehouse. Referential integrity between the elements is maintained, so that e.g. ratings to items not existing in the system are im- possible. Incongruent data is spotted during insert into the recommender warehouse and presented to the data expert. Inside the framework, the data is logically split into two categories: measures (facts) that form the numeric infor- mation for analysis, and dimensions that form the axes of analysis for the related measures. In the framework schema (figure 1), the measures are stylized within the dashed boxes. The dimensions surrounding them and are connected to both, the rating prediction and the item recommendation mea- sures. 4.2 The Measures Both groups of measures analyzed by the framework— the measures for item recommendation algorithms and the measures for rating prediction algorithms—can be divided into basic statistical and information retrieval measures. Statistical measures: Among the basic statistical measures are counts and distinct counts, ranks, (running) dif- ferences and (running) percentages of various totals for each dimension table, train ratings, test ratings and predicted ratings; furthermore, averages and their standard deviations for the lifetime analysis, train rat- ings, test ratings, and predicted ratings. Information retrieval measures: Among the information retrieval measures are the popular MAE and (R)MSE for rating prediction, plus user-wise and item-wise ag- gregated precision, recall and F-measure for item pre- diction. Novelty, diversity, and coverage measures are also included as they provide additional insight. Fur- thermore, for comparative analysis, the differences in the measures between any two chosen (groups of) pre- diction methods are supported as additional measures. In case a recommender system and thus this framework is accompanied by a commercial or scientific application, this application usually will have measures of its own. These measures can easily be integrated into the analysis. An ex- ample may be an eCommerce application adding sales mea- sures such as gross revenue to the framework. These external measures can interact with the measures and the dimension of the framework.1 4.3 The Dimensions The dimensions are used for slicing and dicing the selected measures and for drilling down from global aggregates to fine granular values. For our framework, the dimensions depicted in figure 1 are: Date: The Date dimension is one of the core dimensions for temporal analysis. It consists of standard mem- bers such as Year, Quarter, Month, Week, Day and the respective hierarchies made up from those mem- bers. Furthermore, Year-to-date (YTD) and Quar- ter/Month/Week/Day of Year logic provides options such as searching for a Christmas or Academy Awards related effect. Time: The Time dimension offers Hour of Day and Minute of Day/Hour analysis. For international datasets this dimension profits from data being normalized to the time zone of the creator (meaning the user giving the rating). Age: The Age dimension is used for item and user lifetime analysis. Age refers to the relative age of the user or item at the time the rating is given/received or an item from a recommendation list is put into a shopping basket and allows for analysis of trends in relative time (c.f. section 6). User : User and the related dimensions such as UserProfile and UserDemographics allow for analysis by user mas- ter data and by using dynamically derived informa- tion such as activity related attributes. This enables grouping of the users and content generated by them (purchase histories, ratings) by information such as # of ratings or purchases, # of days of activity, gender, geography... Item: Item and the related dimensions such as ItemCate- gory and ItemComponent parallel the user-dimensions. In a movie dataset, the item components could be, e.g., actors, directors, and other credits. Prediction Method : The Prediction Method dimension al- lows the OLAP user to investigate the effects of the various classes and types or recommender systems and their respective parameters. Hierarchies, such as Rec- ommender Class, Recommender Type, Recommender Parameters, simplify the navigation of the data. eCommerce: As recommender algorithms usually accom- pany a commercial or scientific application (e.g., eCom- merce) having dimensions of its own, these dimensions can easily be integrated into and be used by our frame- work. 1E.g., the revenue could be split up by year and recommen- dation method, showing the business impact of a recom- mender. 116 Experimentation: In case this framework is used in an experiment-driven scenario [8], such as an online or marketing setting, Experimentation related dimensions should be used. They parallel the PredictionMethod dimension, but are more specific to their usage sce- nario. 5. PROTOTYPE DESCRIPTION This section describes the implementation of a research prototype for the proposed framework. The prototype was implemented using Microsoft SQL Server 2008 [18] and was used later for our performance evaluation. In our evaluation, the prototype considers the Movielens 1m dataset [9], which is a common benchmark for recom- mender systems. It consists of 6.040 users, 3.883 items, and 1.000.209 ratings received over roughly three years. Each user has at least 20 ratings and the metadata supplied for the users is userId, gender, age bucket, occupation, and zip- code. Metadata for the item is movieId, title and genre information. Following a classical data warehouse approach [15, 12], the database tables are divided into dimension and fact tables. The dimension tables generally consist of two kinds of in- formation: static master data and dynamic metadata. The static master data usually originates from an ERP system or another authoritative source and contains e.g. naming infor- mation. The dynamic metadata is derived information in- teresting for evaluation purposes, such as numbers of ratings given or time spent on the system. To allow for an always up to date and rich information at the same time, we fol- low the approach of using base tables for dimension master data and views for dynamic metadata derived through var- ious calculations. Further views then expose the combined information as pseudo table. The tables used in the ware- house of the prototype are Date, Time, Genre (instantiation of Category), Item, ItemGenre (table needed for mapping items and genres), Numbers (a helper table), Occupation, PredictedRatings, PredictedItems, PredictionMethod, Tes- tRatings, TestItems, TrainRatings, TrainItems, and User. The Item and User table are in fact views over the mas- ter data provided with the Movielens dataset and dynamic information gathered from usage data. Further views are SquareError, UserwiseFMeasure, AllRatings, and AgeAnal- ysis. On top of the warehouse prototype, an OLAP cube for rating prediction was created using Microsoft SQL Server Analysis Services. Within this cube, the respective mea- sures were created: counts and sums, and further derived measures such as distinct counts, averages, standard devi- ations, ranks, (running) differences and (running) percent- age. The core measures RMSE and MAE are derived from the error between predicted and actual ratings. The most important OLAP task with respect to framework develop- ment is to define the relationships between the measures and dimensions, as several dimensions are linked multiple times (e.g. the Age dimension is role-playing as it is linked against both item age and user age) or only indirect relationships exist (such as between category and rating the relationship is only established via item). Designing the relationships has to be exercised very carefully, as both correctness of the model and the ability to programmatically navigate dimen- sions and measures (adding them on the report axes, mea- sure field or as filters) depend on this step. Linking mem- bers enables generic dimensions such as Prediction Method A, and Prediction Method B, that can be linked to chosen dimension members. This renders unnecessary the creation of the n(n− 1)/2 possible measures yielding differences be- tween any two prediction methods A and B (for, say, RMSE or F-measure). Furthermore, this approach allows choosing more than one dimension member, e.g. several runs of one algorithm with different parameters, as one linked member for aggregate analysis. Before we go on to the evaluation of our prototype, let us state that our framework describes more than simply a model for designing evaluation frameworks. The prototype serves well as a template for other recommender datasets, too. With nothing changed besides the data load procedure, it can be used directly for, e.g., the other Movielens datasets, the Netflix challenge dataset or the Eachmovie dataset. Ad- ditional data available in those datasets (e.g. the tagging information from the Movielens 10m dataset) are either ig- nored or require an extension of the data warehouse and the multidimensional model (resulting in new analysis possibil- ities). 6. PERFORMANCE EVALUATION In the previous section we have described the implementa- tion of a research prototype of the proposed framework using the Movielens 1m dataset. Building on this prototype, we proceed with presenting a set of results that are obtained by applying it. We have to clarify that the objective of our experimental evaluation is not limited to the comparison of specific rec- ommender algorithms, as it is mostly performed in works that propose such algorithms. Our focus is, instead, on demonstrating the flexibility and easiness with which we can answer important questions for the performance of recom- mendations. It is generally agreed that explicitly modelling the effects describing changes in the rating behavior over the various users (user base-effect), items (item base-effect), and age of the respective item or user (time effects) [3, 16, 17]. For this reason, we choose to demonstrate the ben- efits of the proposed framework by setting our scope on those effects followed by exemplary dissecting the perfor- mance of two widely examined classes of recommender algo- rithms, i.e., collaborative filtering and matrix factorization. We also consider important the exploratory analysis of items and users, which can provide valuable insights for business analysts about factors determining the performance of their recommender systems. We believe that the results presented in the following demonstrate how easy it is to obtain them by using the proposed framework, which favors its usage in real-world applications, but also can provide valuable con- clusions to motivate the usage of the framework for pure re- search purpose, since it allows for observing and analyzing the performance by combining all related dimensions that are being modeled. All results presented in the remainder of this section could easily be obtained graphically by navigating the presented measures and dimensions using Excel 2007 as multidimen- sional client. 6.1 Exploratory Data Analysis Using the framework, the first step for a research and a business analytics approach is exploring the data. As an example, the Calendar dimension (Date) is used to slice 117 33.1 3.2 3.3 3.4 3.5 3.6 3.7 2 0 0 0 .0 4 2 0 0 0 .0 6 2 0 0 0 .0 8 2 0 0 0 .1 0 2 0 0 0 .1 2 2 0 0 1 .0 2 2 0 0 1 .0 4 2 0 0 1 .0 6 2 0 0 1 .0 8 2 0 0 1 .1 0 2 0 0 1 .1 2 2 0 0 2 .0 2 2 0 0 2 .0 4 2 0 0 2 .0 6 2 0 0 2 .0 8 2 0 0 2 .1 0 2 0 0 2 .1 2 2 0 0 3 .0 2 2000 2001 2002 2003 A ve ra ge R at in g Date Figure 2: Average rating by date 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% R at in g C o u n t Date Figure 3: Rating count by date (running percent- ages) the average rating measure. Figure 2 presents this as pivot chart. The sharp slumps noticeable in March and August 2002 together with a general lack of smoothness beyond mid 2001 arouse curiosity and suggest replacing average rating by rating count (figure not shown). Changing from counts to running percentages proves that about 50 percent of the ratings in this dataset are spent within the first six months out of nearly three years. Within two more months 90 per- cent of the ratings are assigned, roughly seven percent of the data for 50 percent of the time (figure 3). 6.1.1 Item Analysis The framework allows an easy visualization of the item effect described e.g. in [16], namely that there usually is a systematic variation of the average rating per item. Adi- tionally, other factors can easily be integrated in such an analysis. Figure 4 shows the number of ratings received per item sorted by decreasing average rating. This under- lines the need for regularization when using averages, as the movies rated highest only received a vanishing number of ratings. Moving on the x-axis from single items to rating count buckets containing a roughly equal number of items, a trend of heavier rated items being rated higher can be observed (figure omitted for space reasons). A possible explanation might be that blockbuster movies accumulate a huge num- ber of generally positive ratings during a short time and the all-time classics earn a slow but steady share of additional coverage. That all-time classics receive higher ratings can nicely be proved with the framework, too. Consistent with findings during the final phase of the Netflix competition by Figure 4: Item rating count sorted by decreasing average rating 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 1 5 2 9 4 3 5 7 7 1 8 5 9 9 1 1 3 1 2 7 1 4 1 1 5 5 1 6 9 1 8 3 1 9 7 2 1 1 2 2 5 2 3 9 2 5 3 2 6 7 2 8 1 2 9 5 3 0 9 3 2 3 3 3 7 3 5 1 3 7 1 A ve ra ge R at in g Item Age in Days Figure 5: The all-time classics effect. Ratings tend to increase with the age of the movie at the time the rating is received. Age is measured in time since the first rating recorded. Koren [17], figure 5 shows a justification for the good re- sults obtained by adding time-variant base effects to recom- mender algorithms. Besides the all-time classics effect, the blockbuster effect can also be observed (figure 6), showing that items who receive numerous ratings per day on average also have a higher rating. 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 - 1.12 1.12 - 1.41 1.41 - 1.95 1.95 - 2.82 2.82 - 3.63 3.65 - 4.62 4.63 - 6.11 6.16 - 7.50 7.60 - 7.76 7.97 - 9.31 A ve ra ge R at in g Number of Ratings per Day of Activity Figure 6: The blockbuster effect. Increasing aver- age item rating with increasing number of ratings received per day. Slicing the average rating by Genre shows a variation among the different genre with Film-Noir being rated best (average rating 4.07, 1.83% of ratings received), and Hor- ror being rated worst (3.21, 7,64%). Of the Genres with at least ten percent of the ratings received Drama scores high- est (3.76, 34.45%) and Sci-Fi lowest (3.46, 15.73%). Figure not shown. 118 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 2 0 - 2 1 2 2 - 2 4 2 5 - 2 7 2 8 - 3 1 3 2 - 3 5 3 6 - 3 9 4 0 - 4 5 4 6 - 4 9 5 0 - 5 6 5 7 - 6 3 6 4 - 7 0 7 1 - 7 9 8 0 - 8 9 9 0 - 1 0 0 1 0 1 - 1 1 0 1 1 1 - 1 2 3 1 2 4 - 1 3 8 1 3 9 - 1 5 5 1 5 6 - 1 7 3 1 7 4 - 2 0 1 2 0 2 - 2 2 8 2 2 9 - 2 6 8 2 6 9 - 3 1 1 3 1 2 - 3 7 7 3 7 8 - 4 6 9 4 7 0 - 6 3 0 6 3 1 - 2 3 1 4 A ve ra ge R at in g Number of Ratings per User Figure 7: The effect of the number of ratings per user on the average rating 6.1.2 User Analysis The user effect can be analyzed just as easy as the item ef- fect. Reproducing the analysis explained above on the users, it is interesting to notice that for heavy raters the user rat- ing count effect is inverse to the item rating count effect described above (figure 7): the higher the amount of ratings spent by a given user, the lower his or her average rating. One explanation to this behavior might be that real heavy raters encounter a lot of rather trashy or at least low quality movies. 6.2 Recommender Model Diagnostics For algorithm performance comparison, the Movielens 1m ratings were randomly split into two nearly equal size parti- tions, one for training (500103), and one for testing (500104 ratings). Algorithm parameter estimation was conducted on the training samples only, predictions were conducted solely on the test partition. Exemplarily, a vanilla matrix fac- torization (20 features, regularization 0.09, learn rate 0.01, 56 iterations, hyperparameters optimized by 5-fold cross- validation) is analyzed.2 For a researcher the general aim will be to improve the overall RMSE or F-Measure, depending on the task, as this is usually what wins a challenge or raises the bar on a given dataset. For a business analyst this is not necessarily the case. A business user might be interested in breaking down the algorithm’s RMSE over categories or top items or top users as this may be relevant information from a monetary aspect. The results of the respective queries may well lead to one algorithm being replaced by another on a certain part of the dataset (e.g. subset of the product hierarchy). In figure 8, RMSE is plotted vs. item rating count in train.This indicates that more ratings on an item do help factor models. Interpreted the other way around, for a busi- ness user, this implies that this matrix factorization yields best performance on the items most crucial to him from a top sales point of view (though for slow seller other algo- rithms might be more helpful). The same trend can be spotted when RMSE is analyzed by user rating count on the training set (figure omitted for space reasons), though the shape of the curve follows a straighter line than for the item train rating count (where it follows more an exponential decay). Due to the approach taken in the design of the OLAP 2The matrix factorization yielded an RMSE of 0.8831 given the presented train-test split. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0 - 0 1 - 1 2 - 4 5 - 8 9 - 1 2 1 3 - 1 6 1 7 - 2 1 2 2 - 2 7 2 8 - 3 4 3 5 - 4 3 4 4 - 5 3 5 4 - 6 6 6 7 - 8 0 8 1 - 9 5 9 6 - 1 1 4 1 1 5 - 1 3 6 1 3 7 - 1 6 5 1 6 6 - 1 9 6 1 9 7 - 2 4 1 2 4 2 - 3 0 4 3 0 6 - 3 8 8 3 8 9 - 5 5 1 5 5 2 - 1 7 1 5 R M SE Item Train Rating Count Figure 8: Item rating count effect on a factor model. Buckets created on roughly equal item count. -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0 - 0 1 - 1 2 - 4 5 - 8 9 - 1 2 1 3 - 1 6 1 7 - 2 1 2 2 - 2 7 2 8 - 3 4 3 5 - 4 3 4 4 - 5 3 5 4 - 6 6 6 7 - 8 0 8 1 - 9 5 9 6 - 1 1 4 1 1 5 - 1 3 6 1 3 7 - 1 6 5 1 6 6 - 1 9 6 1 9 7 - 2 4 1 2 4 2 - 3 0 4 3 0 6 - 3 8 8 3 8 9 - 5 5 1 5 5 2 - 1 7 1 5 R M SE D if fe re n ce b et w e e n G A a n d M F Item Rating Count on Train Figure 9: Difference in RMSE between Matrix Fac- torization (MF) and Global Average (GA) vs. rat- ings available per item on the train dataset. cube the number of recommender algorithms comparable as A and B is not limited; neither does it have to be exactly one algorithm being compared with exactly one other, as multiple selection is possible. Furthermore—given the pre- dictions are already in the warehouse—replacing one method by another or grouping several methods as A or B can nicely be achieved by selecting them in the appropriate drop-down list. Exemplarily, the matrix factorization analyzed above is compared to the global average of ratings as baseline rec- ommendation method. Figure 9 reveals that for this factor model more ratings on train do increase the relative per- formance, as expected, up to a point from which the static baseline method will gain back roughly half the lost ground. Investigation of this issue might be interesting for future recommender models. All results presented could be obtained very fast: when judging the time needed to design query and report (chart)— which was on average seconds for construction of the query and making the chart look nice—, and when judging execu- tion time—which was in the sub-second timeframe. 7. CONCLUSIONS We have proposed a novel multidimensional framework for integrating OLAP with the challenging task of evalu- ating recommender systems. We have presented the archi- 119 tecture of the framework as a template and described the implementation of a research prototype. Consistent with the other papers at this workshop, the authors of this work believe that the perceived value of a system largely depends on its user interface. Thus, this work provides an easy to use framework supporting visual analysis. Our evaluation demonstrates, too, some of the elegance of obtaining ob- servations with the proposed framework. Besides showing the validity of findings during the recent Netflix prize on another dataset, we could provide new insights, too. With respect to the recommender performance evaluation and the validity of RMSE as an evaluation metric, it would be inter- esting to see if a significant difference in RMSE concerning the amount of ratings present in the training set would also lead to significant effects in a related user study. In our future work, we will consider the extension of our research prototype and develop a web-based implementation that will promote its usage. 8. ACKNOWLEDGMENTS The authors gratefully acknowledge the co-funding of their work through the European Commission FP7 project My- Media (grant agreement no. 215006) and through the Eu- ropean Regional Development Fund project LEFOS (grant agreement no. 80028934). 9. REFERENCES [1] G. Adomavicius, R. Sankaranarayanan, S. Sen, and A. Tuzhilin. Incorporating contextual information in recommender systems using a multidimensional approach. ACM Trans. Inf. Syst., 23(1):103–145, 2005. [2] G. Adomavicius and A. Tuzhilin. Multidimensional recommender systems: A data warehousing approach. In WELCOM ’01: Proceedings of the Second International Workshop on Electronic Commerce, pages 180–192, London, UK, 2001. Springer-Verlag. [3] R. Bell, Y. Koren, and C. Volinsky. Modeling relationships at multiple scales to improve accuracy of large recommender systems. In KDD ’07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 95–104, New York, NY, USA, 2007. ACM. [4] D. Bollen, B. P. Knijnenburg, M. C. Willemsen, and M. Graus. Understanding choice overload in recommender systems. In RecSys ’10: Proceedings of the 2010 ACM conference on Recommender systems, New York, NY, USA, 2010. ACM. [5] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In MSR-TR-98-12, pages 43–52. Morgan Kaufmann, 1998. [6] O. Celma and P. Herrera. A new approach to evaluating novel recommendations. In RecSys ’08: Proceedings of the 2008 ACM conference on Recommender systems, pages 179–186, New York, NY, USA, 2008. ACM. [7] E. Codd, S. Codd, and C. Salley. Providing OLAP to user-analysts: An it mandate. Ann Arbor,MI, 1993. [8] T. Crook, B. Frasca, R. Kohavi, and R. Longbotham. Seven pitfalls to avoid when running controlled experiments on the web. In KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1105–1114, New York, NY, USA, 2009. ACM. [9] GroupLens. Movielens data sets. http://www.grouplens.org/node/73. [10] C. Hayes, P. Massa, P. Avesani, and P. Cunningham. An on-line evaluation framework for recommender systems. In In Workshop on Personalization and Recommendation in E-Commerce (Malaga. Springer Verlag, 2002. [11] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst., 22(1):5–53, 2004. [12] W. H. Inmon. Building the Data Warehouse. Wiley, 4th ed., 2005. [13] T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 154–161, New York, NY, USA, 2005. ACM. [14] N. Jones, P. Pu, and L. Chen. How users perceive and appraise personalized recommendations. In UMAP ’09: Proceedings of the 17th International Conference on User Modeling, Adaptation, and Personalization, pages 461–466, Berlin, Heidelberg, 2009. Springer-Verlag. [15] R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. Wiley, 2nd ed., 2002. [16] Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 426–434, New York, NY, USA, 2008. ACM. [17] Y. Koren. Collaborative filtering with temporal dynamics. In KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 447–456, New York, NY, USA, 2009. ACM. [18] Microsoft. Microsoft SQL Server 2008 homepage. http://www.microsoft.com/sqlserver/2008/. [19] J. O’Brien and G. Marakas. Management Information Systems. McGraw-Hill/Irwin, 9th ed., 2009. [20] A. Thor and E. Rahm. Awesome: a data warehouse-based system for adaptive website recommendations. In VLDB ’04: Proceedings of the Thirtieth international conference on Very large data bases, pages 384–395. VLDB Endowment, 2004. [21] M. Zhang and N. Hurley. Avoiding monotony: improving the diversity of recommendation lists. In RecSys ’08: Proceedings of the 2008 ACM conference on Recommender systems, pages 123–130, New York, NY, USA, 2008. ACM Press. [22] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen. Improving recommendation lists through topic diversification. In WWW ’05: Proceedings of the 14th international conference on World Wide Web, pages 22–32, New York, NY, USA, 2005. ACM. 120 An Evaluation of Multilabel Classification for the Automatic Annotation of Texts Eneldo Loza Mencı´a Knowledge Engineering Group Technische Universita¨t Darmstadt eneldo@ke.tu-darmstadt.de Abstract This article presents the formulation of an in- formation extraction as a multilabel classifica- tion problem. This representation allows for exploiting annotation overlappings and correla- tions. The standard multiclass approach is com- pared to different multilabel classification algo- rithms. 1 Introduction In recent years, more and more approaches have appeared that translate the IE task into a classical classification prob- lem, which is nowadays considered the most popular ap- proach. The most common approach is to transform each text position, i.e. usually each text token in the document, into a classification example. This is often called boundary classification or sequential tagging/labeling. The class in- formation of the instance depends on whether the underly- ing text token is a part or boundary of the target annotation or not. Figure 1 shows an example tagged sentence. The to- ken The is marked as the beginning of a noun phrase ([NP) that ends at fox with NP]. The standard approach is to train one separate classifier for each annotation type, i.e. one for noun phrases, one for verb phrases etc. This subproblems can be solved using a multiclass classification algorithm that is trained to predict for each token exactly one class. We can often observe an overlapping of the annotations of the different types in real world applications, such as in chunking, syntactic parsing or ontology based information extraction (OBIE). The token The e.g. is at the same time a determiner and the beginning of a noun phrase, which is indeed linguistically a reasonable coincidence. The tradi- tional approach ignores this co-occurrence and is therefore not able to exploit the additional information, namely that determiner are often also noun phrases beginnings. The approach presented in this work therefore reformu- lates the many individual multiclass problems into only one multilabel classification problem. In contrast to multiclass, multilabel classification allows an instance to be associated to several classes, in this context often called labels. This means that a token is now allowed to have simultaneously several classes assigned. The purpose of this representa- tion is twofold: on the one hand we obtain a more natu- ral, compact and consistent representation of the extraction problem. On the other hand, the main purpose of this for- mulation is to allow an underlying multilabel algorithm to exploit relations in the labels such as co-occurrence, exclu- sions and implications and hence improve the prediction quality. This work evaluates several multilabel learning al- gorithms and compares them on a dense annotation dataset. Only few attempts have been made on this subject so far. McDonald et al. [2005] were able to improve ac- curacy in extracting non-contiguous and overlapping seg- ments of different types using an adapted multilabel algo- rithm. However, their algorithm is not directly comparable since it is centered and adapted to sentences whereas the approach presented here is token based and usable with any multilabel algorithm. 2 Preliminaries The transformation of an information extraction task into a classification problem was already sketched in Section 1. The next two paragraphs give a more detailed descrip- tion of the two main processing steps (mainly based on [Loza Mencı´a, 2009]), continued by the introduction of the employed multilabel classification algorithms (a recent overview can be found in [Tsoumakas et al., 2009]). Boundary classification In this work we employ the simple but effective Begin/End approach. The start and the end of each annotation, i.e. only the boundaries, are marked with the tag BEGIN or END, the rest is marked as negative examples with NEG. The bottom two rows in Fig- ure 1 shows for each type DT (determiner), JJ (adjective), NN (noun) etc. the beginning and ending of an annotation.1 In the standard approach, a problem appears when an annotation only includes one token, such as for DT. This would make it necessary to tag a token as BEGIN and END simultaneously. As this would require a multilabel capable underlying classifier, the common approach is to include an additional class UNIQUE which represents both classes at the same time (see also Section 3). We chose a pragmatic way for solving inconsistencies during the reconstruction of the annotations on the test set: we search for the first appearing of an opening tag and con- tinue the extraction until the first matching closing tag is found, the remaining tags are ignored. Feature Generation The boundary classification step generates the class information for each training instance, but up to now these instances are empty. The simplest fea- tures which can be added are the occurrences of the dif- ferent tokens themselves. Since the focus of this work lies on the comparison of the classification algorithms, we use only these simple features and refrain from using sophisti- cated linguistic features. For the same reason we ignore the classification history. 1The negative class is omitted since this is the multilabel rep- resentation. 121 token The quick brown fox jumps over the lazy dog token the=1 quick=1 brown=1 fox=1 jumps=1 over=1 the=1 lazy=1 dog=1 features +1.quick=1 +1.brown=1 +1.fox=1 +1.jumps=1 +1.over=1 +1.the=1 +1.lazy=1 +1.dog=1 -1.the=1 -1.quick=1 -1.brown=1 -1.fox=1 -1.jumps=1 -1.over=1 -1.the=1 -1.lazy=1 POS [DT, DT] [JJ, JJ] [JJ, JJ] [NN, NN] [VBZ, VBZ] [IN, IN] [DT, DT] [JJ, JJ] [NN, NN] syntactic [NP NP] [VP [PP [NP NP], PP], VP] Figure 1: Transformation of a text sentence into a classification problem. Each column shows a token and exemplarily the generated features with a context of one word and the class information of the corresponding generated classification instance. The first row of the class information ’POS’ shows the part-of-speech annotations, the second the syntactic annotations given to the token. Abbreviations according to [Marcus et al., 1993]. A ’[’ denotes the BEGIN and ’]’ the END of the corresponding annotation type. The token features row in Figure 1 shows the type of windowing we used. 2.1 Multilabel Classification We represent an instance or text position as a vector x¯ = (x1, . . . , xM ) in a feature space X ⊆ {0, 1}N . In multil- abel problems, each instance x¯i is assigned to a set of rel- evant labels Y i, a subset of the K possible classes Y ={c1, . . . , cK}, in contrast to multiclass problems, where each instance is mapped to exactly one class, i.e. ‖Y i‖ = 1. Binary Relevance In the binary relevance (BR) or one- against-all (OAA) method, a multilabel problem with K possible classes is decomposed into K binary problems. For each subproblem, a binary classifier is trained to predict the relevance of the corresponding class. QWeighted Calibrated Label Ranking QCLR is a re- cently proposed algorithm, which is an efficient approach for multilabel classification. This algorithms combines three aspects: the pairwise decomposition of multilabel problems, calibrated label ranking for determining multi- label result and an adaption of the QWEIGHTED algorithm for efficient prediction Loza Mencı´a et al. [2009]. In the pairwise decomposition approach, one classifier is trained for each pair of classes, i.e., a problem with K different classes is decomposed into K(K−1)2 smaller subproblems. An example is added to the training set cuvs.cv if cu is a relevant class and cv is an irrelevant class or vice versa, i.e., (cu, cv) ∈ Y × Y ∪ Y × Y with Y = Y\Y as neg- ative label set. During classification, each base classifier is queried and the prediction is interpreted as a vote for one of its two classes. The resulting ranking of classes is split into relevant and irrelevant classes via calibration. The QWEIGHTED approach allows to reduce the classification costs from quadratic to log-linear time. Label Powerset In the label powerset approach (LP), a meta multiclass problem is constructed where each ap- pearing label combination Y i is interpreted as one sepa- rate class. The meta problem is then solved with a nor- mal multiclass algorithm or with the previously presented decomposition methods. In the worse case, the resulting multiclass problem has an increased amount of classes of min(M, 2K) where M is the number of training examples, however the number tends to be much smaller due to class correlations. 3 Multilabel Classification for Information Extraction As already outlined, one of the advantages of multilabel classification is the more natural representation since we do not have to work with tricks. Note e.g. that in the BEGIN/END/UNIQUE scheme the learning algorithm is forced to learn to distinguish between UNIQUE and BE- GIN resp. END though UNIQUE is actually a subset of these two classes. This makes this tagging scheme espe- cially interesting for the usage of multilabel classification. Furthermore, the traditional methods do not permit to ex- ploit relations between several annotation types since each type is by design necessarily learned separately. We present therefore in the following the standard approach together with the multilabel alternatives. Traditional multiclass approach A multiclass clas- sifier is trained for each annotation type, the BE- GIN/END/UNIQUE/NEG scheme is used. I.e. for each annotation type a ∈ A a classifier is trained with instances mapped to exactly one class c in Ya = {BEGIN, END, UNIQUE, NEG}. The multiclass problem is solved via one-against-all decomposition in our case. Binary Relevance and QCLR The extraction task is transformed into one multilabel problem where each token is assigned to a subset Y of Y = A×{BEGIN, END}, with A as the set of annotation types. Note that it is not neces- sary to include the NEG as the algorithm is able to predict the empty set. In the binary relevance setting, the algorithm is not ex- pected to improve from label co-occurrences since each base classifier is trained separately. However, the pair- wise approach is at least potentially able to detect non co- occurrences, since we train for each pair of classes a base classifier with instances where the one class is positive and the other negative, i.e. the base classifier is trained with cases where both classes are mutually exclusive. There- fore this approach may be able detect that a co-prediction of two classes is wrong for a determined instance, in con- trast to BR, where a base classifier cannot state anything else than relevant or non-relevant. Recently, promising ad- vances were made in enhancing the pairwise approach by the detection and exploitation of present constraints on the labels, such as co-occurrences [Park and Fu¨rnkranz, 2008]. We are currently working on incorporating these ideas. Label Powerset The multilabel problem is re- transformed into a multiclass problem, i.e. the possible classes of a token are in Y = 2A×{BEGIN,END}. Note that for only one annotation type this corresponds to the 122 traditional multiclass approach, since we would obtain c ∈ Y = {{}, {BEGIN}, {END}, {BEGIN,END}}, which corresponds to {BEGIN, END, UNIQUE, NEG}. But for more than one annotation type, co-occurrence and implications can effectively be exploited and detected since these co-occurrences are explicitly used as training information. However, the granularity of this information is limited, since the approach is only able to abstract from the co-occurrence of two classes if there is no other class appearing since this would not generate the desired meta co-occurrence class anymore. 4 Evaluation Since it is difficult to obtain densely annotated (free) cor- pora e.g. from the field of OBIE, we decided to generate our own dense dataset with the help of the Stanford Parser, which returns the syntactic structure of a sentence.2 The result of the parser was considered to be the true and cor- rect labeling of the corpus. The first six (scientific) texts from the Learned category of the Brown Corpus [Francis and Kucera, 1979] were annotated with this tool, taking the first three documents for training and the remaining for test- ing. The resulting training set contains 6790 instances and 48 different annotations types, 7091 instances remained for testing. Since each annotation type leads to two tags denot- ing the BEGIN and the END, we obtain 96 different labels for the multilabel problem. In average there are 3.34 labels associated per token. For the label powerset representation, 334 classes were retrieved. A window size of 5 resulted in less than 7000 different features. We used the well known LibSVM library with linear kernel and standard settings as our base learner [Chang and Lin, 2001]. The results are shown in Table 4. The first observation is that the multilabel approaches (QCLR and BR) seem to slightly outperform the traditional multiclass approach in terms of F1, and that ML has a slight advantage over LP. A closer look reveals that recall and precision highly de- pend on the used transformation approaches. LP seems to boost recall while ML and especially the classical multi- class approach improve precision, always at the expense of the opposite measure. The MC setting appears to generate quite conservative classifiers, since the MC extractor pre- dicts 18% less annotations than ML-QCLR and even 28% less than LP-PC. Note that the absolute values may seem generally low, but remind that these results were produced without lin- guistic or any other intelligent preprocessing. Moreover, only exact annotation matches were taken into account, counting token matches improves the rates to around 70%. 5 Conclusions We have presented the approach of presenting an infor- mation extraction problem as one multilabel classification problem rather than several independent multiclass prob- lems. This view is more natural to the extraction problem and furthermore potentially allows the exploitation of rela- tions and correlations between overlapping annotations. Although all multilabel approaches achieve higher F1 scores in the experiments than the standard approach, a di- rect comparison of both approaches shows up to be diffi- cult, since the traditional approach is focused on precision while the multilabel approaches obtained higher recall to the extend of the precision. Evaluations on more corpora 2http://www-nlp.stanford.edu/software/lex-parser.shtml Algorihm Precision Recall F1 MC 74.21 34.32 46.93 ML-BR 72.49 40.52 51.98 ML-QCLR 71.49 40.18 51.44 LP-BR 59.67 43.46 50.29 LP-PC 65.12 41.73 50.87 Table 1: Prediction quality of the different algorithms based on exact annotation matches, micro-averaged over all annotation types. MC for the traditional multiclass al- gorithm, ML for the multilabel transformation and LP for label powerset. PC denotes the pairwise decomposition ap- proach for multiclass problems. and perhaps with a more extensive also linguistic prepro- cessing are planned in order to obtain a clearer picture. Nevertheless, it has been demonstrated that the formula- tion as multilabel problem is at least comparable, particu- larly considering that the employed algorithms are not es- pecially adapted or designed in order to exploit class cor- relations. Recent advantages in the relatively new field of multilabel classification let us expect substantial improve- ments (cf. [Tsoumakas et al., 2009]). Furthermore, we are currently investigating an adaption of the pairwise ap- proach that benefits from restrictions and constraints on the possible class constellations [Park and Fu¨rnkranz, 2008]. 6 References Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Software available at http:// www.csie.ntu.edu.tw/∼cjlin/libsvm. W. N. Francis and H. Kucera. Brown corpus manual. Techni- cal report, Department of Linguistics, Brown University, Prov- idence, Rhode Island, US, 1979. Eneldo Loza Mencı´a, Sang-Hyeun Park, and Johannes Fu¨rnkranz. Efficient voting prediction for pairwise multilabel classifica- tion. In Proceedings of the 11th European Symposium on Arti- ficial Neural Networks (ESANN-09), 2009. Eneldo Loza Mencı´a. Segmentation of legal documents. In Pro- ceedings of the 12th International Conference on Artificial In- telligence and Law, pages 88–97, 2009. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2):313– 330, 1993. Ryan T. McDonald, Koby Crammer, and Fernando Pereira. Flex- ible text segmentation with structured multilabel classification. In Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, 2005. Sang-Hyeun Park and Johannes Fu¨rnkranz. Multi-label clas- sification with label constraints. In Proceedings of the ECML/PKDD-08 Workshop on Preference Learning (PL-08), pages 157–171, 2008. Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Mining multi-label data. In Data Mining and Knowledge Dis- covery Handbook. Springer, 2009. 123 Pattern Mining in Sparse Temporal Domains, an Interpolation Approach Work in Progress Christian Po¨litz University of Bonn poelitz@iai.uni-bonn.de Abstract Weblog systems, mobile phone companies or GPS devices collect large amounts of personal- ized data including temporal, positional and tex- tual information. Patterns extracted from such data can give insight in the behavior and mood of people. These patterns are often imprecise due to sparseness in the data. We propose an interpola- tion technique that augments local patterns with elements that seem to be locally unimportant but with global information they are interesting. 1 Introduction Large amounts of data are gathered in social web appli- cations, on mobile phone calls, GPS signals from cars or animals, to name only a few. The companies that possess these data are highly interested in benefiting from the con- tained information. One way to extract such information is to find patterns. By patterns we mean regularities like frequently mentioned topics in weblogs, paths that certain animals take for foraging or main traffic routes. All the data that we investigate include temporal infor- mation like the time a weblog is written, the time stamp of a telephone call or the time of GPS signals. This leads to two characteristics for the patterns we are interested in. First the patterns must be temporally ordered, for instance which topics follow a certain frequent topic in weblogs or frequently consecutively visited places. These patterns are called sequential patterns and several algorithmical ap- proaches exist to solve this task. An overview of different sequential pattern mining methods and some special cases are given by Zhao and Bhowmick in [Q. Zhao, 2003]. The second characteristic is that the patterns have a tem- poral extend and transition times. This means each pattern has a clear beginning time and an end time. All contained subpatterns have also a beginning and an end. From this we indirectly know the time interval of the pattern. The transition time is the time interval between two consecu- tive elements of the pattern and can be calculated from the temporal extend of the subpatterns. Possible patterns can be the evolution of topics in blog entries. Such information is useful for many different com- panies. They can directly search for such topic patterns that deal with their products or services. Other patterns could be movements of groups of people in certain areas. These patterns can be used for targeted advertisements by poster campaigns. To know where certain people move could in- fluence the kind of commercial being shown on posters. One big problem by extracting patterns from the data is sparseness. Sparseness means that the data contains large id topics 1 A B C 2 A C 3 A B 4 B C 5 A C 6 A B C [s1, e1] [s2, e2] [s3, e3] time intervals Table 1: A possible development of topics in a weblog. temporal gaps and/or very few data points with an assign- ment to one specific identifier like blog entry author, mo- bile phone device or GPS device. Such weaknesses in the data can badly influence the search for local patterns. As a result we may get patterns with very large temporal in- tervals with no information of the intermediate time. We try to deal with this problem by using global information to interpolate between consecutive pattern elements with a big temporal difference. The global information can be of arbitrary source, additionally or data immanent. In weblogs for instance, ”bloggers” might write entries quite rarely or there are large breaks between entries. This can for example be due to individual behavior or holidays. For these times it might be impossible to retrieve local pat- terns. We try to overcome this problem by using global information of all bloggers in the corresponding times. We simply use the assumption that the hot topics in the time in which we lack local information would also track the attention of the corresponding bloggers. On table 1 we see a possible distribution of topics A, B and C that are discussed by six different bloggers in certain time intervals. We see that two of three persons who write about topic A in time interval [s1, e1] do write later on in time interval [s3, e3] about topic C. This leads to the rule if someone writes about topic A, they also write about topic C in the given times with a confidence of four fifth. For the intermediate time between the end of topic A e1 and the beginning of topic C s3 there is topic B. Only two persons write about topic B in the mean time after writ- ing about topic A and before topic C. The data supports this only by one third. Applying strictly a pattern mining method we could loose this intermediate topic due to the low support. From the data itself there is the possible pattern of writ- ing about topic A in time interval [s1, e1] leads to writ- ing about topic B in [s2, e2]. Further there is the pattern of people writing about topic B in [s2, e2] will later write 125 D C B A Figure 1: Possible movement pattern in geographical space and in time. about topic C in [s3, e3]. Both patterns have a confidence of three fifth. Since topic B was mentioned four times in the time between A and C we augment the previous pattern by: if people write about topic A and C in the corresponding times they will also write about topic B in the intermediate time [s2, e2]. From this example we gain the following knowledge. A possible interpolation of a pattern (or augmentation) should only contain elements (topics) with at least a cer- tain amount of appearance. Additionally the interpolated topics should by at least mentioned by some of the persons that form the previous (original) pattern. An other example that concentrates on movement pat- terns of GPS data is shown on figure 1. Four places are vis- ited in different times indicated by the height of the nodes. The GPS signals have unequal dwell times which is shown by different sizes of the nodes. The solid directed line shows a possible pattern among the places extracted from the GPS data. This pattern leads to the rule that if someone was at place A at the specific time they will be later at place D at another specific time. Due to unregularities and annoyances there is a large temporal and spatial gap between place A and D. From the whole data set we know that in the intermediate time the places C and B are frequently visited and some of the people that support the previous patterns have been at this times at these places, but not enough to make this pattern interesting. From this information we augment the pattern by the places C and B. The resulting pattern is shown on figure 1 as dashed directed lines. 2 Pattern Extraction Method The task to solve the above stated problems is a clear pat- tern mining task. An introduction in pattern mining and a survey on several algorithms is described by Goethals in [Goethals, 2003]. First we describe definitions about the patterns we want to find. We include constrains that make sure we retrieve reasonable patterns from data from a temporal domain con- taining additional information like texts from blog entries or GPS coordinates of cars. Generally we define a pattern as P = i1[s1, e1], i2[s2, e2], · · · , in[sn, en]. ij are the basis elements of the patterns, analogue to the items in frequent item set mining. The elements are for instance topics in weblogs, mobile radio cells or clusters of GPS signals. Each element has a starting time sj and an ending time ej . This means at this time interval the topics are frequently used in weblogs, many phone calls were made in radius of a cell or lots of GPS signals are received. Analogue to sequential pattern mining we assume the elements to be temporally ordered, i.e. sk ≤ sl for k < l. We define the elements ij as disjoint subsets of the an- alyzed data set D = {o1, · · · , on}. Each subset contains only data points with a similarity larger than a given thresh- old. The similarity must be defined w.r.t. to the domain of the data and the application. For instance, if the data are blog entries and we are interested in patterns of topics, the similarity could be the semantical alikeness of the words in the entries to predefined or learned topics. In this case each subset would contain all the blog entries that contain words that indicate a certain topic. In case the data are mobile phone calls, the similarity should be the mobile radio cell the phone is connected to (We assume that the mobile phones are only connected to one cell at a time). This means all mobile phone calls in a certain cell are highly similar and are contained in the same subset. For GPS data of cars the similarity could be the geo- graphical distance to certain streets or to concentrations of many other cars (clusters). A grid (even or uneven) division of the traveled area is also possible. Hence all GPS data of cars within a specific grid cell are similar. These are only a few possible implementations of a sim- ilarity. In general we define the elements as described in equation 1 with a similarity measure sim : D×D → < and a user defined (domain and application dependent) thresh- old parameter τ . ij = {o1j , · · · , onj |oij ∈ D, sim(okj , olj ) ≥ τ} (1) Among the elements, patterns are extracted w.r.t. a fre- quency measure. The measure depends on the uniqueness of the data points. For that we assume the data points to be assigned to an identifier idj . Possible identifiers are the persons that wrote blog entries, mobile phones or GPS de- vices of cars. We generally assume the following structure of data points oj = (X, t, id) with X a vector information, t a time stamp and id the identifier. The information X can be the textual content of a blog entry, the cell information of a mobile radio cell or the position of a GPS device at time t. A one element pattern ij [sj , ej ] is frequent when it con- tains more than a predefined threshold nmin many unique identifiers. Additionally all contained data points have a time stamp t ∈ [sj , ej ] and are similar w.r.t. the above mentioned similarity measure sim. A continuation of an n − 1 elements pattern P = i1[s1, e1], · · · , in−1[sn−1, en−1] are the elements inj [snj , enj ] with the following properties: • inj [snj , enj ] contains at least nmin data points with identifiers that are also contained in in−1[sn−1, en−1] • all data points in inj [snj , enj ] with identifier idj are temporally after the data point in in−1[sn−1, en−1] that have the identifier idj too Since the time span of the elements can be very large we partition the temporal information in intervals. These intervals can be periods with many blog entries to a certain topic, times with lots of phone calls or congestion of cars. A calendrical partitioning for instance in days or weeks is also possible. To find such partitions we define an additional temporal similarity on the time information of the data points. All data points ok ∈ ij must have a similarity simt : D × 126 D → < larger than a predefined threshold τt (see equation 2). Together with the previous statements we define the elements ij in equation 2. ij = {o1j , · · · , onj |oij ∈ D, sim(okj , olj ) ≥ τ, simt(okj , olj ) ≥ τt} (2) We include this statement in our pattern definition. For the one element pattern ij [sj , ej ] we have the additional restriction that the contained data points are similar w.r.t. the temporal similarity measure simt. The continuations inj [snj , enj ] have the additional property: • all data points in inj [snj , enj ] with identifier idj that are in in−1[sn−1, en−1] too must be similar w.r.t. to the temporal similarity simt The made definitions can be easily integrated into a se- quential pattern mining method like PrefixSpan. In [Pei et al., 2001] Pei et al. introduce the sequential pattern min- ing algorithm PrefixSpan that generates sequential patterns by growing prefixes of patterns. Such a method extracts all sequential patterns that are supported by at least nmin identifiers. For blog entries this means that at least nmin different persons must have written articles to the topics in the found patterns in the specified times. In case of GPS or telephone data the places in the patterns must be visited in the corresponding time intervals by at least nmin different GPS or telephone devices. 3 Pattern Extraction of Sparse Data In the case of sparse data the pattern extraction becomes more difficult as we described above. In this case sparse- ness means that only very few data points share the same identifier while on the other hand there are many different identifiers. The data points with the same identifiers addi- tionally have large differences in their temporal attributes. To cope with these shortcomings we suggest an interpo- lation technique that uses all data points to find descrip- tions of the time between data points with the same iden- tifier. According to a pattern extraction method we in- fer intermediate sequence elements of the patterns. This is done by allowing to add to a pattern sequence P = i1[s1, e1], · · · , in−1[sn−1, en−1] not only elements having enough data points with identifiers also contained in ele- ment in−1[sn−1, en−1] but having enough other data points with a very large similarities. For possible continuations inj [snj , enj ] of a pat- tern P = i1[s1, e1], · · · , in−1[sn−1, en−1] we state that in case the temporal difference |en−1 − snj | ex- ceeds a predefined threshold ∆t and there exists no other inj′ [snj′ , enj′ ] that is temporally between en−1 and snj with the properties for a continuation, we insert between in−1[sn−1, en−1] and inj [snj , enj ] in- terpolated subpatterns P ∗ = i∗1[s ∗ 1, e ∗ 1], · · · , i∗n[s∗n, e∗n] that result in an augmentation of the pattern to: in−1[sn−1, en−1], P ∗, inj [snj , enj ]. By ∆t we try to generate reasonable patterns with no more than this threshold of time between consecutive pat- tern elements. Depending on how much the temporal dif- ference of two such pattern elements extends the threshold we allow to augements the pattern by less supported ele- ments. For that we introduce a new parameter n∗min (see ??. n∗min = ∆t · nmin sn−1 − enj (3) The first element i∗1[s ∗ 1, e ∗ 1] of the pattern P ∗ contains more than a predefined threshold nmin many unique iden- tifiers of which at lest n∗min many are also in inj [snj , enj ]. Additionally all contained data points are temporally be- tween en−1 and snj , further they are similar w.r.t. the above mentioned similarity measures sim and simt. The continuation of an n − 1 elements subpattern P ∗ = i∗1[s ∗ 1, e ∗ 1], · · · , i∗n−1[s∗n−1, e∗n−1] are the elements i∗nj [s ∗ nj , e ∗ nj ] with the following properties: • i∗nj [s∗nj , e∗nj ] contains at least n∗min < nmin data points with identifiers that are also contained in in−1[sn−1, en−1] and in inj [snj , enj ] • all together i∗nj [s∗nj , e∗nj ] contains at least nmin data points with different identifiers • all data points in i∗nj [s∗nj , e∗nj ] with identifier idj are temporally after the data points in in−1[sn−1, en−1] that have the identifier idj too and later than the ones in inj [snj , enj ] • all data points in i∗nj [s∗nj , e∗nj ] must be similar w.r.t. the similarity measure sim • all data points in inj [snj , enj ] with identifier idj that are in in−1[sn−1, en−1] too must be similar w.r.t. to the temporal similarity simt These definitions can be easily integrated in our previ- ously stated pattern extraction algorithm. For a continua- tion of a pattern with large temporal difference as described above we simply apply again a sequential pattern mining method like PrefixSpan, but only to data points that have timestamps in the corresponding time interval. A schematic example of the interpolation and augmenta- tion of patterns of weblogs is shown in table 2 and for GPS movement patterns on figure 2. As quality for the interpolated elements we use a similar- ity measure siminter based on the differences d(i∗j , i ∗ j+1) between consecutive elements i∗j , i ∗ j+1. We generally as- sume that such elements do not differ too much. This as- sumption is similar to many interpolation techniques that require smooth interpolating functions. For movement data the difference could be the geographical distance and for weblog semantical difference between topics. Finally the similarity of the elements is the corresponding difference divided by their temporal difference s∗j+1−, e∗j (see 4). siminter(i ∗ j , i ∗ j+1) = d(i∗j , i ∗ j+1) s∗j+1 − e∗j (4) 4 Future and Ongoing Work Currently we are applying our proposed interpolation method on several data sets. Experiments on GPS signals show promising results. In this case it is easier to interpret the results since they can be shown on maps and the pat- terns are accustomed movements. The results on a large data set of GPS signals from cars are shown on figure 3. The patterns are extracted among concentrations of cars in Milan in the morning. There is a pattern of 5 cars starting at place A from 4:06 am to 6:52 am reaching place E from 6:10 am to 7:09 am. The corresponding data points that form this pattern have a large temporal and a large spatial difference. W.r.t. the found patterns it seems very likely that cars moving from A to E went there by B, C and/or D. Applying our method for 127 pattern hot topics among weblog entries P · · · in−1 in P∗1 i ∗ 1 i ∗ 2 · · · i∗n−1 i∗n P∗2 i ∗ 1 i ∗ 2 · · · i∗n−1 P∗3 i ∗ 1 i ∗ 2 · · · · · · P ∗ P∗1 · · · in−1 i∗1 i∗2 · · · i∗n−1 i∗n in P ∗ P∗2 · · · in−1 i∗1 i∗2 · · · i∗n−1 in P ∗ P∗3 · · · in−1 i∗1 i∗2 · · · in · · · [sn−1, en−1] [s∗1 , e ∗ 1 ] [s ∗ 2 , e ∗ 2 ] · · · [s∗n−1, e∗n−1] [s∗n, e∗n] [sn, en] time intervals Table 2: A schematic representation of interpolated patterns of topics in weblogs. in . . . i∗1 . . . i∗2 . . . . . . i∗n in−1 Figure 2: A schematic representation of a possible interpo- lation of two consecutive pattern elements with large tem- poral/spatial distance. E D C B A Figure 3: Augemented patterns among concentrations of cars. pattern extraction on sparse data with n∗min = 2 we retrieve additional patterns as shown by the dashed lines on figure 3. There are 2 of the 5 cars from the original pattern that went from A to D to eventually arrive at E. Beside these 2 cars there were several others cars in the intermediate time from 5:05 am to 5:54 am. We can now augment the pattern of moving from A to E by place D. On weblog data an interpolation is harder to validated. We plan to artificially include sparseness in data. We want to analyze how good the augmented patterns are compared to corresponding patterns found without artificial sparse- ness. To achieve this we use retrieved patterns from we- blog that have no sparse elements w.r.t. the statements above. Furthermore we insert sparseness into the data on data points that support the previously found patterns. On different degreess of sparseness we want to analyze how good the augmented patterns can reconstruct the original ones and how much they differ. In conjunction with that, we plan to further investigate the influence of the parameters nmin and n∗min. We hope to find heuristic descriptions of possible settings of these parameters concerning the domain of the data, additional (statistical) information and the application. An other important aspect is the computation time of the pattern extraction and the interpolation. We generally as- sume the data to be ordered in time. By this the patterns can be found in a faster way and less main memory is used since we must only consider data with timestamps larger than the last element of the current pattern. Although this means in worst case we have to scan the whole data (O(n)) generally we assume local patterns to be small enough and short in time to be easily placed in main memory. For the interpolation we even expect subpattern with very few ele- ments. That is due to the fact that only a smaller number of data points with timestamps between the corresponding el- ements of the original pattern might exists compared to the whole data set. Different data structures and data sources will be investigated in this context. The propose method seems very promising to find pat- terns among large amounts of data from a temporal domain with additional sparseness. We are planning to show this on many different data sets in future works. 5 Aknowledgement The work has been done within the research project Vi- AMoD Visual Spatiotemporal Pattern Analysis of Move- ment and Event Data, which is funded by DFG Deutsche Forschungsgemeinschaft (German Research Foundation) within the Priority Research Programme Scalable Visual Analytics (SPP 1335). We also thank the anonymous re- viewers for their helpful suggestions. References [Goethals, 2003] B. Goethals. Survey on frequent pattern mining. Technical report, 2003. [Pei et al., 2001] Jian Pei, Jiawei Han, Behzad Mortazavi- asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Mei chun Hsu. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. pages 215–224, 2001. [Q. Zhao, 2003] S.S. Bhowmick Q. Zhao. Sequential pat- tern mining: a survey. Technical report, Technical Re- port Center for Advanced Information Systems, School of Computer Engineering, Nanyang Technological Uni- versity, Singapore, 2003. 128 SVM Classifier Estimation from Group Probabilities Stefan Ru¨ping Fraunhofer IAIS Schloss Birlinghoven, 53754 St. Augustin, Germany stefan.rueping@iais.fraunhofer.de Abstract A learning problem that has only recently gained attention in the machine learning community is that of learning a classifier from group probabil- ities. It is a learning task that lies somewhere between the well-known tasks of supervised and unsupervised learning, in the sense that for a set of observations we do not know the labels, but for some groups of observations, the frequency distribution of the label is known. This learning problem has important practical applications, for example in privacy-preserving data mining. This paper presents an approach to learn a classifier from group probabilities based on support vec- tor regression and the idea of inverting a classi- fier calibration process. A detailed analysis will show that this new approach outperforms exist- ing approaches1 1 Introduction A learning problem that has only recently gained attention in the machine learning community is that of learning a classifier from group probabilities [Kueck and de Freitas, 2005; Quadrianto et al., 2008; 2009]. It is a learning task that lies somewhere between the well-known tasks of su- pervised and unsupervised learning, in the sense that for a set of observations we do not know the labels, but for some groups of observations, the frequency distribution of the la- bel in the groups is known (see Figure 1). The goal is, from this information alone, to estimate a classifier that works well on the labeled data. As noted in [Quadrianto et al., 2009], this learning prob- lem has received surprisingly little attention so far, even though it has many interesting applications. One of the most natural applications comes in analyzing the outcomes of political elections, where the population of all voters in an electoral district is known, but only the total number of votes per party in each district is revealed. However, from an analysis of this data, e.g. the dependence of votes on variables such as income or household types, can show up interesting connections, and may be used to uncover elec- tion fraud when outliers from this model are uncovered. Another interesting application comes from privacy- preserving data mining. In some settings, revealing the label of the observations may imply serious privacy con- cerns. For example, in medical research following the out- break patterns of a new type of influenza virus is an impor- 1This paper also appears in the Proceedings of the 27th Inter- national Conference on Machine Learning (ICML 2010) Figure 1: Classifier estimation from group probabilities tant task, but revealing which patient actually got infected with the new virus may be viewed as information that is confidential between him and his treating physician. How- ever, outbreak frequencies in certain risk groups are usually anonymous data, such that they are not sensitive informa- tion. To give another example, in fraud detection it is com- mon practice to apply machine learning to fraud / non-fraud data. While this seems to be straight-forward, in practice labeling some person as a fraudster has serious legal im- plications, in particular when this data is given to a third person for analysis. Even if it is clear that a person has not paid for some merchandise or service he received, there may be perfectly legal reasons not to do so. Hence, it may only be legally safe to label someone as a fraudster if he was convicted by a court of law. In the end, storing only risk probabilities over small groups of people may be the legally advisable way in these cases. To put it more plainly, the difference between fraud labels for observations and group probabilities in this case translate to the difference between the statements this per- son is a fraudster and the much less aggressive in this group of 5 people the risk probability is 20%. In this paper, we will present an algorithm for learning a classifier from group probabilities, which is based on ideas from support vector regression and classifier calibration. The remainder of this paper is structured as follows: in the following section related work is discussed, before Sec- tion 3 introduces the new algorithm, which will be called Inverse Calibration. Section 4 empirically compares the new algorithm to existing approaches. Finally, Section 5 concludes. 129 2 Related Work In this section, we will first present related work for learn- ing a classifier from group probabilities. We will also present existing work on the related task of estimating con- ditional probabilities from a given classifier, which will be relevant later on. 2.1 Estimation of a Classifier from Group Probabilities The task of estimating a classifier from set probabilities de- scribes the setting, where groups of unlabeled observations are given and the only information about the distribution of the labels comes from the frequencies of the labels in each group. A method for estimating a classifier from group prob- abilities, the Mean Map method, has been proposed in [Quadrianto et al., 2009]. The method is based on mod- eling the conditional class probability p(y|x, θ) using con- ditional exponential models: p(y|x, θ) = exp((Φ(x, y)θ) − g(θ|x)) with a normalizing function g. The parameter θ of the model is estimated by taking the known observation means of the groups and inferring from them and the known class frequencies per group the example means given classes. [Quadrianto et al., 2009] defines the learning problem with a transductive component as well, where the distribu- tion of the labels in the test set is known. However, in this paper we do not assume that this information is known. Algorithmically, the Mean Map method boils down to solving a convex optimization problem. While the method is defined for joint kernels on X × Y , a spe- cial case exists for the case of binary classification where k((x, y), (x′, y′)) = yy′k′(x, x′). Since in this paper we are only interested in binary classification, this variant is used in the experiments. The paper [Quadrianto et al., 2009] also gives a de- tailed overview of other related techniques, such as meth- ods based on kernel density estimation, discriminative sort- ing, or generative models and MCMC [Kueck and de Fre- itas, 2005]. However, it was found that none of these meth- ods can outperform their Mean Map method, and hence they are not investigated in detail in this paper. 2.2 Estimating Conditional Probabilities Given a binary classification task described by an unknown probability distribution P (X,Y ) on an input space X and a set of labels Y = {−1, 1}, a probabilistic classifier is a function fprob : X → [0, 1] that returns an estimate of the conditional class probability, i.e. fprob(x) ≈ P (Y = 1|x). A standard approach to probabilistic classification is to calibrate a numerical classifier. That is, for a numerical classification function cl(x) = sign(fnum(x)) the task is to find an appropriate scaling function σ : R → [0, 1] such that σ(fnum(x)) ≈ P (Y = 1|x) holds. A comparative study by [Niculescu-Mizil and Caruana, 2005] revealed that Platt Calibration [Platt, 1999] and Iso- tonic Regression [Zadrozny and Elkan, 2002] are the most effective probabilistic calibration techniques for a wide range of classifiers. Platt Calibration [Platt, 1999] was originally introduced for scaling Support Vector Machine (SVM) outputs, but has been shown to be efficient for many other numerical decision functions as well [Niculescu-Mizil and Caruana, 2005]. Based on an empirical analysis of the distribution of SVM decision function values, Platt suggests to use a scaling function of the form σPlatt(f(x)) = 1 1 + exp(−Af(x) +B) . The parameters A and B are optimized using gradient de- scent to minimize the cross-entropy error Isotonic Regression [Zadrozny and Elkan, 2002] as- sumes a monotonic dependency between the decision func- tion and the conditional class probabilities and finds a piecewise constant, monotonic scaling function that mini- mizes the quadratic loss by making use of the pair-adjacent violators algorithm [Ayer et al., 1955]. Other probabilistic calibration techniques have also been proposed in the literature, for example Softmax Scaling, Binning, or calibration by Gaussian modeling of the deci- sion function. A finding that holds particular significance for our ap- proach is that often even very trivial calibration techniques without an elaborate parameter estimation procedure can produce reasonably good probability estimates [Ru¨ping, 2004]. 3 The Algorithm Problem formulation: Let P (X,Y ) be a fixed, but unknown probability distribution and let (x1, y1), . . . , (xn, yn) ⊂ X × {−1, 1} be drawn i.i.d. from P . Assume we are given m subsets of (x1, . . . , xn), where we identify the k-th subset by the set of its indices Sk = {ik,1, . . . , ik,|Sk|}. Let pk = |{i ∈ Sk : yi = 1}|/|Sk| be the estimate of the conditional class probability P (Y = 1|Sk). The goal is to find a classifier f : X → {−1, 1} with minimal error according to P , given only the x1, . . . , xn, the S1, . . . , Sm and the p1, . . . , pm are known. Inversion of Class Probability Estimation: Our ap- proach is to invert the process of estimating conditional class probabilities from Section 2.2. In conditional class probability estimation a classifier f is trained first, and then a scaling function σ is fitted such that σ(f(x)) is a good es- timate of P (Y = 1|x). We instead start with given probability estimates p, fix a scaling function σ, apply this inverse scaling function and train an SVM to predict the values σ−1(p). This approach is partly motivated by [Ru¨ping, 2004], which shows that even very trivial probabilistic scaling functions without an elaborate parameter fitting procedure can give reasonable estimates of p. In our algorithm, we use the scaling function p = σ(y) = 1 1 + exp(−y) which can be seen as a special case of Platt scaling [Platt, 1999] with A = 1 and B = 0. In particular, we will make use of its inverse y = σ−1(p) = − log(1 p − 1). 130 learn scaling fixed classifier learn classifier scaling fixed Classifier Calibration Classifier Estimation from Probabilities Figure 2: Classifier estimation from group probabilities by inverting the classifier calibration process. To simplify notation, in the following, we use p and y, or pi and yi, interchangeably and always imply p = σ(y). In order to avoid undefined values of y, we clip p to the inter- val [ε, 1− ε], where ε is a parameter defining the minimum required precision of the estimate. A reasonable choice is to take ε = 1/#examples. Our goal is to estimate a linear classification function f(x) = wx + b. In order to classify well, we require it to predict y well, which in turn implies that σ ◦ f is a good estimate of p. However, in our problem we are not given estimates of p for every observation x, but only for sets S of observations. In particular, depending on the construction of S, the optimal class probability estimates of the single observations in S may very much vary around their aver- age p. To circumvent this problem, we only require that f predicts y well on average: ∀i : 1|Si| ∑ j∈Si (wxj + b) ≈ yi. We can now formally define the learning task formally in the spirit of Support Vector Regression [Vapnik, 1998], which will in particular allow a kernelization later. Primal Problem: 1 2 ||w||2 + C m∑ i=1 (ξi + ξ ∗ i ) → min s.t. ∀mi=1 : ξi, ξ∗i ≥ 0 ∀mi=1 : 1 |Si| ∑ j∈Si (wxj + b) ≥ yi − εi − ξi ∀mi=1 : 1 |Si| ∑ j∈Si (wxj + b) ≤ yi + εi + ξ∗i The formulation requires to both minimize the complexity of the model and to try to keep the class probability esti- mate of Si close to pi, where the maximum tolerable error is defined by εi. Note that to keep the optimization prob- lem in the form of a quadratic problem, the error is defined with respect to yi and not pi, which is the true target. We will fix this inconsistency in the following. Usually in Support Vector Regression one would use a constant value for the required precision, i.e. ∀i : εi = ε. However, in this case the goal is not to estimate yi but pi, such that instead of setting a precision limit on y we actually require pi − ε ≤ σ(yi) ≤ pi + ε ⇔ pi − ε ≤ 1 1 + exp(−yi) ≤ pi + ε ⇔ − log( 1 pi − ε − 1) ≤ yi ≤ − log( 1 pi + ε − 1) A Taylor expansion of order 1 of the function p 7→ − log( 1p+ε − 1) around the point p = pi yields − log( 1 pi + ε − 1) ≈ − log( 1 pi − 1) + ε pi(1− pi) = yi + ε pi(1− pi) . and hence we set εi = ε pi(1− pi) . Dual Problem: It is straightforward to prove that the pri- mal problem can be efficiently solved in its dual form, which is 1 2 ∑m i,j=1 (αi−α∗i )(αj−α∗j ) |Si||Sj| ∑ i′∈Si,j′∈Sj K(xi′ , xj′ ) + ∑m i=1 (αi(εi − yi) + α∗i (εi + yi))→ min s.t. ∑m i=1(αi − α∗i ) = 0 ∀mi=1 : 0 ≤ αi, α∗i ≤ C. where K is a kernel function. The minimization can be carried out by a standard solver for quadratic optimization problems. In the following, this approach will be called Inverse Calibration. 4 Experiments We compared the new approach, called Inverse Calibra- tion, to learning classifiers from set probabilities empiri- cally on 12 data sets from the UCI machine learning repos- itory [Asuncion and Newman, 2007]. Table 1 lists the data sets that were used. To construct probability examples, we picked different set sizes k and randomly partitioned the original data sets into sets of size k (plus one set of size < k, if necessary). Values of k = 2, 4, 8, 16, 32 and 64 were chosen in the experiments. We performed tests with linear kernels, radial basis kernels with parameters γ = 0.01, 0.1 and 1 and polynomial kernels with degrees 2 and 3. Hence, in total 12 ∗ 6 ∗ 6 = 432 experiments were performed. A 10-fold cross-validation was executed in each experiment. For tuning the parameters of the meth- ods, an internal cross-validation loop was applied in each training phase. As performance measure, we want to use the accuracy of predicting the labels in the test set. That is, we assume that while set probabilities are given in the training examples, the ultimate goal is to induce a classifier that accurately predicts the labels. In order for the analysis to be indepen- dent of the default error rate in each data setsD, we use the accuracy of a method M relative to the accuracy that can 131 Table 1: Datasets used in the Experiments. DATA SET SIZE DIMENSION HEART-C 303 23 PRIMARY-TUMOR 339 24 IONOSPHERE 351 34 COLIC 368 61 VOTE 435 17 SOYBEAN 683 84 CREDIT-A 690 43 BREAST-W 699 10 DIABETES 768 9 VEHICLE 846 19 ANNEAL 898 64 CREDIT-G 1000 60 be achieved by a standard classification SVM that has full access to the labels on the training set as our performance measure of choice: accuracyrel(M,D) := accuracy(M,D) accuracy(full SVM,D) We compared the Inverse Calibration algorithm with the Mean Map method which has been proven to perform su- perior to all competing approaches in [Quadrianto et al., 2009]. In order to find out how much of the performance of the Inverse Calibration method is due to the general prop- erties of SVMs, we compare our method against simpler approaches of applying SVMs to the problem of learning from probabilities. The following trivial variants are in- cluded in our tests: Reg: directly predicting the transformed probabilities of each example using a regression SVM. The same label y was used for each element of a set. RegS: directly predicting the transformed probabilities using a regression SVM, but using only the mean of the vectors in each set as an example (i.e., one exam- ple per set). Class: directly predicting the label of each example, us- ing the label 1 for every example in a set S iff the probability of S is higher than the default probability in the complete data set. ClassS: same as Class, but taking only the mean of the vectors in each set as an example (i.e., one example per set). Table 2 list the number of wins of each method in the columns against each method in the row, and against all other methods in total, over all trials. It can be seen that Inverse Calibration and the Mean Map method are clearly superior to the trivial methods: in a direct comparison, the trivial methods lose against the more advanced ones in at least 85% of all trials. In only 6% of all cases, a trivial method outperforms the other methods. Hence, in the fol- lowing, only those two methods will be compared in more detail. 4.1 Dependency of the Performance on k Figure 3 shows the relative accuracies of the Inverse Cal- ibration versus the Mean Map method over all 432 tests. Both approaches generally achieve high relative accuracies, it can be seen that in most of the trials at least 70% of the accuracy of a classification SVM with full information was achieved. To show up an interesting structure in the results, trials with low values of k, namely k = 2, 4, 8, were plotted as blue triangles, while trials with high values of k, namely k = 16, 32, 64, are plotted as red circles. For low k, i.e. for many sets with few observations per set, both approaches seem to perform roughly similar (see below for detailed sta- tistical tests), and also often not much worse than the stan- dard SVM. On the other hand, for high k, i.e. in a situation with less information, it can be seen that Inverse Calibra- tion frequently outperforms the Mean Map method. 0.6 0.7 0.8 0.9 1.0 1.1 0. 7 0. 8 0. 9 1. 0 Quadrianto et al. n ew a lg or ith m Figure 3: Relative Accuracy of Inverse Calibration (vertical axis) vs. Relative Accuracy of Mean Map (horizontal axis) over all tests. Blue triangles depict test with low values of k, while red circle depict trials with high k. To investigate the dependency on k in detail, Figure 4 shows boxplots of the relative accuracy of Inverse Calibra- tion minus the relative accuracy of the Mean Map method over all trials for each k. It can be seen that while generally Inverse Calibration performs better for all k (mean values are positive), the difference become especially pronounced the larger the k. The same effect can be seen in Figure 5, which plots the actual relative accuracies of the two methods over all k. In addition, the relative accuracy of the best trivial method, classS, is also plotted. Again, while the relative accura- cies decrease with increasing k, the gap between the differ- ent methods widens. 4.2 Dependency of the Performance on the Kernel Finally, we are interested in the effect of the kernel func- tion. Figure 6 shows boxplots of the relative accuracy of the Inverse Calibration method minus the relative accuracy of the Mean Map method over all trials for each kernel. It can be seen that the results are quite stable, with a slightly better performance for the linear kernel. However, the RBF kernel with parameter γ = 1 shows a high variance. The explanation of the erratic performance of this kernel can be found in Figure 7, which shows the actual accuracies 132 Table 2: Comparison of methods over 432 trials. Table lists the number of wins of the method in a column against the method in a row. Last row compares the method in column against all other methods. METHOD INV. CALIBRATION MEAN MAP REG. CLASS. REG. SETS CLASS. SETS INVERSE CALIBRATION - 139 8 13 10 33 MEAN MAP 292 - 13 52 15 62 REGRESSION 423 419 - 331 150 369 CLASSIFICATION 416 380 101 - 111 277 REGRESSION OVER SETS 422 417 36 321 - 360 CLASSIFICATION OVER SETS 395 370 58 149 66 - ALL METHODS 268 132 0 6 1 19 2 4 8 16 32 64 − 0. 3 − 0. 2 − 0. 1 0. 0 0. 1 0. 2 0. 3 k D iff e re n ce in re la tiv e a cc u ra cy Figure 4: Boxplot of Difference in Relative Accuracy of Inverse Calibration vs. Mean Map over all k. Higher values imply better performance of Inverse Calibration. of the methods. It can be seen that this kernel on the aver- age shows a worse performance than the other ones, which is possibly due to overfitting the training data. As a con- sequence, random variations have a much higher influence on the performance of the learners in this case. 4.3 Overall Results Table 3 shows a detailed comparison of Inverse Calibration with the Mean Map method over all kernels and all values of k. In total, the Inverse Calibration method outperforms Mean Map in 28 of the experiments, performs equally well on 4 and is worse on another 4 experiments (note that each experiment is a test over 12 data sets). A Wilcoxon signed rank test, as suggested by [Demsar, 2006], confirms the statistical significance of the results. Over all trials, a p-value of 6.91e − 07 is achieved, which confirms that the new Inverse Calibration method outperforms the Mean Map method. Further, the Inverse Calibration method per- forms particularly well for the linear kernel and RBF ker- nels with low γ. Vice versa, it can be seen that the new approach is only significantly outperformed for the RBF kernel with param- eter γ = 1 and k = 2, 4. However, as has been already discussed in Section 4.2, this kernel function exhibits a low performance on the average and hence is not of a particu- 2 4 8 16 32 64 k m e a n r e la tiv e a cc u ra cy 0. 6 0. 7 0. 8 0. 9 1. 0 Figure 5: Relative Accuracy of Inverse Calibration (left bar, red) vs. Mean Map (middle bar, green) vs. Classification on Sets (right bar, blue) over all k. Horizontal lines show averages over all k. larly high importance compared to the other kernels. 5 Conclusions and Future Work Estimating classifiers from group probabilities is an impor- tant learning task with many practical applications. How- ever, it has only recently begun to receive attention in the research community. In this paper, we have presented a new algorithm for estimating a classifier from group probabilities based on support vector regression and inverse classifier calibra- tion. A detailed comparison of the new Inverse Calibra- tion algorithm with the best previously known approach of [Quadrianto et al., 2009] has revealed that the new algo- rithm performs significantly better, in particular in the case of linear kernels and high compression factors, i.e. high number k of observations per group. In all other cases, both approaches have been shown to exhibit comparable perfor- mance. Algorithmically, the Inverse Calibration method is a quadratic optimization problem, for which efficient solvers exists, while the Mean Map method has to be opti- mized with more general solvers. While the new method works only for binary classifica- tion, Quadrianto’s approach is also defined for an arbitrary 133 Table 3: Comparison of Inverse Calibration vs. Mean Map for each k and kernel. Table list the number of wins/ties/losses and the p-value of a Wilcoxon signed rank test of the hypothesis that the Inverse Calibration method is better than Mean Map. Results that are significant at the 10% level are printed in bold. K DOT RBF(0.01) RBF(0.1) RBF(1) POLY(2) POLY(3) ALL 2 11/0/1 7/2/3 5/0/7 3/0/9 8/0/4 6/0/6 40/2/30 p = 0.005 p = 0.130 p = 0.782 p = 0.989 p = 0.118 p = 0.535 p = 0.351 4 10/0/2 9/0/3 8/0/4 4/0/8 6/0/6 7/0/5 44/0/28 p = 0.015 p = 0.156 p = 0.143 p = 0.893 p = 0.333 p = 0.465 p = 0.083 8 10/0/2 9/0/3 9/0/3 7/0/5 9/0/3 5/0/7 49/0/23 p = 0.034 p = 0.130 p = 0.143 p = 0.602 p = 0.130 p = 0.688 p = 0.072 16 9/0/3 10/0/2 9/0/3 7/0/5 6/1/5 6/0/6 47/1/24 p = 0.056 p = 0.056 p = 0.070 p = 0.050 p = 0.302 p = 0.512 p = 0.002 32 11/0/1 11/0/1 9/0/3 10/0/2 9/1/2 8/0/4 58/1/13 p = 0.018 p = 0.027 p = 0.079 p = 0.004 p = 0.092 p = 0.235 p < 1e-3 64 10/0/2 9/0/3 9/0/3 10/0/2 8/0/4 9/0/3 55/0/17 p = 0.015 p = 0.143 p = 0.070 p = 0.003 p = 0.087 p = 0.044 p < 1e-3 61/0/11 55/2/15 49/0/23 41/0/31 46/2/24 41/0/31 293/4/135 p < 1e-3 p = 0.027 p = 0.020 p = 0.044 p = 0.037 p = 0.275 p < 1e-3 dot rbf(0.01) rbf(0.1) rbf(1) poly(2) poly(3) − 0. 3 − 0. 2 − 0. 1 0. 0 0. 1 0. 2 0. 3 kernel D iff e re n ce in re la tiv e a cc u ra cy Figure 6: Difference in Relative Accuracy of Inverse Cali- bration vs. Mean Map over all kernels. Higher values imply better performance of Inverse Calibration. number of classes. It would be interesting to see if it is possible to extend the new approach to multiple classes, for example by making use of ideas from algorithms for multiclass SVMs [Duan and Keerthi, 2005]. A practically very interesting direction for future work lies in taking the construction process of groups into ac- count. In this paper, we have taken an i.i.d. assumption, which is reasonable when one does not know otherwise. However, in situations like privacy-preserving data mining, where full information is available to one party, but not the second party that is analyzing the data, both parties could still agree on a process that tries to set up the groups in a way to both guarantees data privacy and allow for an ef- fective classifier estimation under these constraints. Such a process could, for example, be built up upon ideas from active learning [Tong and Koller, 2000]. dot rbf(0.01) rbf(0.1) rbf(1) poly(2) poly(3) kernel m e a n a cc u ra cy 0. 70 0. 75 0. 80 0. 85 0. 90 0. 95 Figure 7: Comparison of the accuracy of Inverse Calibra- tion (left bar, red) vs. Mean Map (right bar, green) over all kernels. Higher values imply better performance of Inverse Calibration. References [Asuncion and Newman, 2007] A. Asuncion and D.J. Newman. UCI machine learning repository. http:// www.ics.uci.edu/ ˜ mlearn/, institution = Uni- versity of California, Irvine, School of Information and Computer Sciences, 2007. [Ayer et al., 1955] M. Ayer, H. Brunk, G. Ewing, W. Reid, and E. Silverman. An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics, 5:641–647, 1955. [Demsar, 2006] Janez Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006. [Duan and Keerthi, 2005] Kai-Bo Duan and S. Sathiya Keerthi. Which is the best multiclass svm method? an 134 empirical study. In Nikunj C. Oza, Robi Polikar, Josef Kittler, and Fabio Roli, editors, Proc. Multiple Classi- fier Systems (MCS 2005), volume 3541 of LNCS, pages 278–285. Springer, 2005. [Kueck and de Freitas, 2005] H. Kueck and N. de Freitas. Learning about individuals from group statistics. In Un- certainty in Artificial Intelligence (UAI), page 332339, Arlington, Virginia, 2005. AUAI Press. [Niculescu-Mizil and Caruana, 2005] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, pages 625–632, 2005. [Platt, 1999] John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likeli- hood methods. In A. Smola, P. Bartlett, B. Scho¨lkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers. MIT Press, 1999. [Quadrianto et al., 2008] Novi Quadrianto, Alex J. Smola, Tibrio S. Caetano, and Quoc V. Le. Estimating labels from label proportions. In W. Cohen, A. McCallum, , and S. Roweis, editors, Proceedings of the 25th Annual International Conference on Machine Learning (ICML 2008), page 776783. Omnipress, 2008. [Quadrianto et al., 2009] Novi Quadrianto, Alex J. Smola, Tibrio S. Caetano, and Quoc V. Le. Estimating labels from label proportions. Journal of Machine Learning Research, 10:2349–2374, Oct 2009. [Ru¨ping, 2004] Stefan Ru¨ping. A simple method for estimating conditional probabilities in SVMs. In A. Abecker, S. Bickel, U. Brefeld, I. Drost, N. Henze, O. Herden, M. Minor, T. Scheffer, L. Stojanovic, and S. Weibelza hl, editors, LWA 2004 - Lernen - Wissensentdeckung - Adaptivita¨t. Humboldt-Universita¨t Berlin, 2004. [Tong and Koller, 2000] S. Tong and D. Koller. Restricted bayes optimal classifiers. In Proceedings of the 17th National Conference on Artificial Intelligence (AAAI), 2000. [Vapnik, 1998] V. Vapnik. Statistical Learning Theory. Wiley, Chichester, GB, 1998. [Zadrozny and Elkan, 2002] Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 694–699, 2002. 135 Fast-Ensembles of Minimum Redundancy Feature Selection Benjamin Schowe and Katharina Morik Technische Universita¨t Dortmund {schowe,morik@ls8.cs.tu-dortmund.de} Abstract Finding relevant subspaces in very high- dimensional data is a challenging task not only for microarray data. The selection of features must be stable, but on the other hand learning performance is to be increased. Ensemble meth- ods have succeeded in the increase of stability and classification accuracy, but their runtime prevents them from scaling up to real-world applications. We propose two methods which enhance correlation-based feature selection such that the stability of feature selection comes with little or even no extra runtime. We show the efficiency of the algorithms analytically and empirically on a wide range of datasets. 137 Potenzial des Data Mining fu¨r Ressourcenoptimierung mobiler Gera¨te im Krankenhaus Rene Schult, Bastian Kurbjuhn Otto-von-Guericke-Universita¨t 39106 Magdeburg, Deutschland schult@ovgu.de bastian.kurbjuhn@st.ovgu.de Abstract Computerunterstu¨tztes Ressourcenmanagement in Krankenha¨usern wird, nicht zuletzt durch den Kostendruck, immer wichtiger. Wir be- sprechen, wie mit Data Mining Technologien das Ressourcenmanagement fu¨r mobile medizi- nische Gera¨te in einem Krankenhaus unterstu¨tzt werden kann und welches Potenzial in der Auswertung von Nutzungsdaten fu¨r ein Kranken- haus bestehen. 1 Einleitung In Krankenha¨usern gibt es viele mobile medizinische Gera¨te, die fu¨r die Behandlung am Patienten beno¨tigt werden. In einem Krankenhaus mittlerer Gro¨ßenordnung summiert sich der Anschaffungspreis dieser Gera¨te auf mehrere Hunderttausende von Euro oder gar daru¨ber. Dies stellt einen erheblichen Anteil am gebundenen Kapital eines Krankenhauses dar. Durch den Kostendruck der Krankenha¨user und der Anforderung, mit den eingeset- zten Ressourcen effizienter umzugehen, ergeben sich aus der Sicht des Ressourcenmanagements sowie aus der Lit- eratur heraus u.a. folgende Problemstellungen, die auch beim Facility Management in Krankenha¨usern immer mehr an Bedeutung gewinnen: (i) Einhaltung von gesetzlichen Vorgaben, etwa zur Dokumentation der Nutzung oder zur Wartung der Gera¨te, (ii) Einhaltung von Vorgaben zur Verfu¨gbarkeit der Gera¨te oder zur Qualita¨t der Leis- tung, ([Pocsay and Distler, 2009]; [Mauro et al., 2010]) (iii) Minimierung der Investitionskosten und der laufenden Kosten ([Salfeld et al., 2009]; [Mauro et al., 2010]), (iv) Minimierung des Wartungsaufwands und somit auch der Wartungskosten ([Salfeld et al., 2009]; [Mauro et al., 2010]). Im Folgenden schildern wir anhand eines Fallbeispiels die Aspekte, die beim Ressourcenmanagement von mo- bilen Gera¨ten im Krankenhaus zu beru¨cksichtigen sind, und erla¨utern dabei das Potenzial der Erfassung und Ana- lyse von Daten zur Gera¨tenutzung. Das hier geschilderte Problem bezieht sich derzeit auf die tragbaren Infusions- spritzen und Infusionspumpen, kann aber inhaltlich sicher auf weitere, wenn nicht gar alle mobilen medizinischen Gera¨te u¨bernommen werden. Es besteht die Problematik, dass das technische Personal jedes Gera¨t fu¨r die Erfu¨llung von Wartungsaufgaben oder anderen Servicedienstleistungen finden muss, bzw. wissen muss, ob es derzeit gerade bei einem Patienten im Einsatz ist oder nicht. Die derzeitig gesetzlich vorgeschriebenen Wartungsintervalle fu¨r die oben genannten Gera¨te betra¨gt ein Jahr. Das Problem des Auffindens eines Gera¨tes hat auch das Schwesternpersonal, wenn es ein freies Gera¨t fu¨r einen neuen Patienten sucht. Fu¨r die Gera¨te existiert je Station meist ein zentraler Abstellplatz, wobei der je nach Sta- tion sehr unterschiedlich sein kann. Um nun ein freies Gera¨t fu¨r einen neuen Patienten oder fu¨r die Wartung zu finden, wird derzeit der Lagerraum der Station aufge- sucht. Ist das Gera¨t dort nicht vorhanden, wird die Sta- tion nach dem Gera¨t abgesucht. Dieses Vorgehen ist ge- rade fu¨r das technische Personal sehr zeitintensiv, denn einerseits muss das technische Personal den jeweiligen Ab- stellort der Station kennen und falls das gesuchte Gera¨t dort nicht vorhanden ist, die gesamte Station danach ab- suchen und hoffen, dass es gerade nicht bei einem Patien- ten im Einsatz ist. Durch dieses bisherige Vorgehen und der Notwendigkeit aus medizinischer Sicht, ausreichend mo- bile medizinische Gera¨te bereitzuhalten, ist die derzeitige Situation so, dass viel mehr Gera¨te der unterschiedlichen Typen, in diesem Beispiel Infusionspumpen und -spritzen, bereit gehalten werden, als tatsa¨chlich beno¨tigt werden. Wenn man bedenkt, dass eines dieser Gera¨te zwischen 1000 und 1800 Euro kostet und man die aktuellen Zahlen des Krankenhauses sich vor Augen fu¨hrt, dass 151 In- fusionspumpen und 286 Infusionsspritzen derzeit in Be- nutzung sind, erkennt man, dass allein nur bei diesen bei- den Gera¨tetypen mehr als 550.000 Euro als gebundenes Kapital im Bestand sind. Unser Lo¨sungsansatz umfasst folgende Komponenten: (1) Ortung der mobilen medizinischen Gera¨te, (2) online Erfassung der Bewegungsdaten zu den Gera¨ten durch au- tomatische Auswertung der Ortungsdaten, (3) Ableitung von nutzungsrelevanten Informationen (Ereignisse) aus den Bewegungsdaten und (4) Auswertung der Nutzungsdaten zur Berechnung der tatsa¨chlichen Verfu¨gbarkeit und des tatsa¨chlichen Bedarfs fu¨r ein jedes Gera¨t. Darauf auf- bauend soll die Optimierung inkl. Vorhersagen stattfinden. Durch die geeignete Nutzung technischer Mo¨glichkeiten soll die Ortung der mobilen medizinischen Gera¨te verein- facht und deren Nutzungsstatus der Gera¨te abgebildet wer- den. Basierend darauf soll danach die Nutzung der mobilen medizinischen Gera¨te (Infusionspumpen und -spritzen) mit Hilfe von Data Mining Methoden analysiert werden, um eine optimalere Nutzung der Gera¨te zu ermo¨glichen und im letzten Schritt die Mo¨glichkeiten der Reduzierung des Gera¨tebestandes zu untersuchen. Der Einsatz von Data Mining Methoden bietet sich vor allem unter der Nebenbe- dingung an, dass das medizinische Personal fu¨r die Ortung und Nutzungsbestimmungen der medizinischen Gera¨te ihre bestehenden Arbeitsprozesse mo¨glichst wenig bzw. gar nicht anpassen will und kann. Durch diese Optimierungsschritte soll eine detailliertere 139 Planung des Bestandes der medizinischen Gera¨te und deren Auslastung mo¨glich werden, was u.a. zur Verringerung des gebundenen Kapitals fu¨hren kann. 2 Relevante Literatur Das Ressourcenmanagement (auch Facility Management genannt) wird schon seit langem in der Literatur diskutiert. Zunehmend findet das computerunterstu¨tzte Ressourcen- management (CAFM) Betrachtung ([Na¨vy, 2006]). Die German Facility Management Association (GEFMA) hat in ihrem Arbeitskreis fu¨r Facility Management im Krankenhaus unter anderem herausgestellt, dass durch die Einfu¨hrung der DRGs nach professionelleren Vorge- hensweisen und Optimierungspotenzialen ([Odin, 2010]) in Krankenha¨usern gesucht wird. In ([Ko¨chlin, 2004]) verdeutlicht Ko¨chlin den Nutzen von Facility Manage- ment im Krankenhaus. In ([Salfeld et al., 2009]) stellen Salfeld et al. heraus, dass vor allem die nicht klinischen Funktionen Kosteneinsparungspotenziale bieten, zu denen auch das Facility Management geho¨rt. In ([Pocsay and Distler, 2009]) und in ([Mauro et al., 2010]) betrachten die Autoren vor allem das Potenzial der IT, um mo¨gliche Einsparungspotenziale zu nutzen, bzw. die Prozesse in einem Krankenhaus zu verbessern. Der Schwerpunkt liegt bei beiden jedoch eher auf dem Austausch der Daten, was u.a. durch SOA ermo¨glicht werden soll, als auf der Opti- mierung von Prozessen und Ressourcen. Data Mining Methoden kamen bei Alapont et. al. in [Alapont et al., 2004] bei der Auslastungsprognose von Betten in spanischen Krankenha¨usern zum Einsatz. Dort wurden die statistischen Methoden LinearRegres- sion, LeastMedSq, SMOreg, MultilayerPerception, Kstart, LWL, Tree DecissionStump, Tree M5P und IBK miteinan- der verglichen. Die Methoden LinearRegression und Tree M5P eigneten sich dort am besten, um die Prognosefehler- rate zu reduzieren. Die Betrachtung von tragbaren medizinischen Gera¨ten als Ressource des Krankenhauses in Kombination mit dem computerunterstu¨tzten Ressourcenmanagement wurde bisher in der Literatur nicht betrachtet. 3 Grundlagen Fu¨r ein effektives Ressourcenmanagement werden Daten u¨ber die Nutzung der zu betrachtenden Gera¨te beno¨tigt. Gerade bei mobilen Gera¨ten ist die Beschaffung der Daten u¨ber die Nutzung und deren Lage nicht so ein- fach, durch deren Eigenschaft, dass die Gera¨te mobil nutzbar sind. In unserem Projekt erfolgt die Ortung mittels WLAN Technologie. Ein WLAN Chip wird am zu u¨berwachenden Gera¨t angebracht und meldet sich in regelma¨ßigen Absta¨nden bei einem Server u¨ber ver- schiedene AccessPoints. Durch spezielle Software la¨sst sich die Lage eines Gera¨tes ermitteln. Diese Lageposi- tionen in Kombination mit einem Zeitstempel werden in einer Datenbank gespeichert. Die Datenbankstruktur ist in Tabelle 1 dargestellt. In der Tabelle 1 werden zu jedem WLAN Chip (TagID) die dazugeho¨rigen Parameter wie X und Y Koordinaten der aktuellen Position sowie Mapping Parameter dafu¨r einge- tragen. Des Weiteren werden der aktuelle Zeitstempel des Eintrags und die Sendequalita¨t des WLAN Chips gespei- chert. Somit stellt die Datenbank eine Art Log-File fu¨r die einzelnen WLAN-Chips und deren aktuellen Positio- nen da. Mittels Data Mining Techniken zur Log-File Ana- lyse lassen sich die Daten fu¨r einzelne Gera¨te gruppieren und heraus filtern. Durch Auswertung dieser o¨rtlichen Daten zu einem speziellen Gera¨t und dem Nutzen von ra¨umlichen Informationen zum Geba¨ude, lassen sich Daten zur Nutzung eines Gera¨tes ermitteln. Dadurch stehen fu¨r jedes medizinische Gera¨t Daten u¨ber deren Position und deren Nutzung in einem zeitlichen Verlauf zur Verfu¨gung. Durch diese zeitlichen Daten lassen sich fu¨r jedes einzelne Gera¨t, aber auch fu¨r jede Station oder auch das gesamte Krankenhaus verschiedene Kennzahlen, wie Auslastung des Gera¨tes, Ruhezeiten, Reparaturzeiten, zuru¨ckgelegte Wege usw. ermitteln. 4 Wissenschaftliche Fragestellungen Die aus Business Intelligence Sicht interessanten Fragestel- lungen ergeben sich aus der Analyse dieser einzelnen Zeitreihen. 1. Es ist fu¨r das Krankenhaus interessant, zu analysieren, wie groß die Auslastung der einzelnen medizinischen Gera¨te ist. Darauf aufbauend kann man versuchen zu analysieren, worin die mo¨glichen Unterschiede der Nutzungsgrade der einzelnen Stationen liegen. Dies bedeutet eine Klassifikation so durchzufu¨hren, dass man unterscheiden kann, welche Eigenschaften der er- fassten Daten auf eine hohe bzw. geringe Nutzung der Gera¨te zuru¨ckzufu¨hren ist. 2. Aus den Nutzungsstatistiken und den erkannten Abha¨ngigkeiten fu¨r eine hohe bzw. geringe Nutzung der Gera¨te kann man versuchen eine Vorhersage abzuleiten fu¨r den zuku¨nftigen Bedarf an den medi- zinischen Gera¨ten. Gerade diese Vorhersagen ko¨nnen ein Einsparpotenzial aus o¨konomischer Sicht be- deuten, denn wenn diese Vorhersagen geringer als der derzeitige Gera¨tebestand ausfallen, reichen weniger Gera¨te fu¨r die gleiche medizinische Leistung ohne Qualita¨tseinbußen. Sicher sollte man dabei aber noch eine Art Puffer an Gera¨ten einfu¨hren, um z.B. Schwankungen nach oben im Bedarf abfangen zu ko¨nnen. Um die erste beschriebene Fragestellung zu lo¨sen, er- mittelt man auf den vorhandenen Daten der Datenbank und den darauf folgenden Aufbereitungen der Log-Daten der einzelnen Gera¨te ein deskriptives Modell auf den Daten um zwischen den Eigenschaften fu¨r die Nutzung und der Nichtnutzung unterscheiden zu ko¨nnen. Hierbei ist es in- teressant herauszufinden, ob es 1. u¨berhaupt typische Eigenschaften fu¨r die Nutzung bzw. Nichtnutzung der Gera¨te gibt und 2. mit welchen Data Mining Methoden lassen diese sich am besten ermitteln bzw. die Daten in die definierten Nutzergruppen am besten unterteilen. Eignet sich dazu z.B. besser eine Regressionsanalyse oder ein Clusteringalgorithmus oder auf Klassifikationsmod- elle? Fu¨r die zweite Fragestellung ist es von starkem In- teresse, ob sich qualitativ hochwertige Vorhersagen u¨ber die Nutzung der einzelnen Gera¨te machen lassen. Dabei kann man versuchen zwischen den einzelnen Gera¨ten, den Gera¨ten die einer Station zugewiesen sind oder auch den Gera¨tegruppen des gesamten Krankenhauses zu unterschei- den. Die Vorhersagen in qualitativ hochwertiger Weise 140 Datenfeld Datentyp Zusa¨tze Beispiel Id BIGINT NOT NULL 1 UNSIGNED AUTO INCREMENT PRIMARY KEY TagId VARCHAR(20) NOT NULL 3003121670 MacAdress VARCHAR(20) 00:00:B3:00:00:06 PosX INT(10) NOT NULL 240 PosY INT(10) NOT NULL 588 PosModelId INT(10) 1 PosMapId INT(10) 3 PosMapName VARCHAR(100) floor2 PosZoneId INT(10) 17 PosZoneName VARCHAR(100) hall-a PosQuality INT(10) 99 PosTimestamp TIMESTAMP NOT NULL 29.05.06 15:21 Table 1: Datenbankstruktur fu¨r die Ortung wa¨ren eine ideale Grundlage, um das Ressourcenmanage- ment des Krankenhauses fu¨r diese medizinischen Gera¨te zu unterstu¨tzen und gleichzeitig Kosten durch die Ver- ringerung des gebundenen Kapitals in diese Gera¨te zu sparen. 5 Ansatz fu¨r die Umsetzung Aus den gewonnenen Informationen zur Ortung des Gera¨tes kann als Eigenschaft (Zielgro¨ße) definiert wer- den, ob ein Gera¨t zur Zeit im Einsatz ist: Hierzu mu¨ssen jedoch sogenannte Lagerorte bestimmt werden. Nach der Datenbankdefinition setzt sich der Standort aus der PosMapId(Name) und der PosZoneId(Name) zusammen bzw. werden so die Ra¨ume abgebildet (siehe Tabelle 1). Weist man nun bestimmte Ra¨ume als (Stations-)Lagerorte aus, so kann anhand der Position des Gera¨tes abgeleitet werden, ob es sich zu der Zeit im Einsatz befindet. Da die Wartungen außer Haus stattfinden, wa¨ren die Gera¨te zu Reparaturzeiten nicht im WLAN des Krankenhauses angemeldet. Bei einem kontinuierlichen Logging be- deutet ein Fehlen eines Gera¨tes in der Ortungsdaten- bank u¨ber einen kurzen Zeitraum, dass das Gera¨t sich in der Wartung befindet oder entwendet wurde. Es kann aber auch bedeuten, dass technische Sto¨rungen vorliegen. Diese Ereignisse ko¨nnen u¨ber die Dauer der fehlenden Protokollierung voneinander abgegrenzt werden. Da die durchschnittliche Dauer einer Gera¨teu¨berpru¨fung aus den Wartungsinformationen der Vergangenheit bekannt ist (und die Pru¨fung einem Standard folgt), kann ein erlaubter Zeitraum definiert werden, in dem das Gera¨t fernbleiben darf. Wird diese Zeitgrenze mitsamt einer dazugeho¨rigen durchschnittlichen Reparaturzeit u¨berschritten, steigt die Wahrscheinlichkeit, dass ein Fehler in der Kommunika- tion des Chips mit dem Access Point vorliegt. Bei einer Auswertung zur Bestimmung des Nutzungsgrades mu¨ssen die Daten, die von solchen technischen Problemen betrof- fen sind, bereinigt werden (also zu große Fehlzeiten sind in Frage zu stellen). In Verbindung mit dem Zeitstempel und den Verlaufsdaten bietet es sich dann an, die Auslas- tung einzelner Gera¨te zu bestimmen. Fu¨r die Ermittlung des Auslastungsgrades pro Gera¨t gilt: Auslastungsgrad = Einsatzzeiten+Reparaturzeiten Betrachtungszeitraum Diese Schritte zur Ausrechnung des Auslastungsgrades und der Bereinigung der Daten bzgl. der Wartungszeiten oder technischen Probleme sind typische Schritte der Date- naufbereitung in diesem Optimierungsprojekt. 5.1 Deskriptives Modellkonzept - Clustering von Gera¨ten mit a¨hnlichen Nutzungsverhalten Nachdem nun die Auslastungsgrade der einzelnen Gera¨te, deren Lager- und Reparaturzeiten anhand der vorhande- nen Log-Daten in der Ortungsdatenbank bestimmt werden ko¨nnen, muss betrachtet werden, welche Eigenschaften eine Nutzung bzw. eine Nichtnutzung der Gera¨te aus- machen. Folgende Einflussfaktoren ko¨nnen sich auf die Auslastung niederschlagen: • der Verbraucher des Gera¨tes (die Station) Bei welcher Station ist das Gera¨t zuletzt im Einsatz gewe- sen? Die La¨nge des Aufenthaltes in der Station gibt Aufschluss daru¨ber, welche Stationen gegenu¨ber an- deren Abteilungen die gro¨ßeren Verbraucher sind bzw. mehr auf die Gera¨te als andere Stationen angewiesen sind. Wie oft greift eine Station auf ein Gera¨t zu? So kann ermittelt werden, ob eine Station auf bestimmte Gera¨te angewiesen ist und die Station Hauptnutzer dieses Gera¨tes ist. • der aktuelle Lagerort/Standort (Raumbezug) Welcher Lagerort ist dem Gera¨t zugewiesen? Wenn ein Gera¨t einen Lagerraum hat, an dem sich in der na¨heren Umgebung keine Großverbraucher (Stationen mit besonders hohem Bedarf an mo- bilen Gera¨ten) befinden, ist die Wahrscheinlichkeit geringer gegenu¨ber anderen gleichen Gera¨ten, dass sie ausgeliehen (genutzt) werden. So wird die Annahme getroffen, dass sie einen unterdurchschnit- tlichen Nutzungsgrad aufweisen werden als ihre Kollegen. Denn die Beschaffung eines Gera¨tes in einer Umgebung verbrauchsarmer Stationen ist fu¨r einen Verbraucher mit zusa¨tzlichem (Weg-)Aufwand verbunden. Es wird unterstellt, dass solch ein Gera¨t den letzten Ausweg darstellt, falls keine gleichen in der na¨heren Umgebung fu¨r die Nachfragestation nicht in Benutzung sind. • die Natur des Gera¨tes (Gera¨tetyp) Was fu¨r ein Gera¨tetyp liegt vor? Unter Beru¨cksichtigung externer statistischer Erhebungen in der Gesundheitsbranche wird deutlich, dass ein Einsatz eines Gera¨tes auch von seiner Auspra¨gung abha¨ngt. Annahme: Allge- 141 meine Gera¨te (Infusionsspritzen und -pumpen) wer- den ha¨ufiger verwendet als Spezialgera¨te (Herzmas- sagegera¨te/Beatmungsgera¨te). Per WLAN-Chip-ID wird ein Gera¨t eindeutig identifiziert. Wenn bekannt ist, um was fu¨r ein Gera¨t es sich handelt (Spei- cherung der Gera¨teinformationen in einer weiteren Stammdatenbank oder Erweiterung der Ortungsdaten- bank um das Feld Type (Gera¨tetyp), kann dieser Ein- flussfaktor selbst bestimmt werden. Hieraus wird ersichtlich, dass sich Verbraucher, Lagerort und Gera¨t gegenseitig bedingen und den Auslastungs- grad beeinflussen. Es muss nun na¨her untersucht wer- den, welche Konstellationen fu¨r eine Nutzung sprechen und welche eine Lagerung verantworten. 5.2 Anwendung von Data-Mining-Verfahren Bei der statistischen Methode werden die Kennzahlen wie Auslastungsgrad (in Bezug auf das Gera¨t) und Verwen- dungsgrad (in Bezug auf die Station) aus den Werten der Datenbank u¨ber gezielte Datenbankabfragen berechnet. Um jedoch auch Prognosen bzw. Trends, die eine Einsatz- planung der Gera¨te erlauben, bewerkstelligen zu ko¨nnen, bedarf es geeignete Methoden aus dem Bereich des Data Mining zu finden. Auf der einen Seite bieten sich Clusteralgorithmen, wie der K-Means Algorithmus an, um die Gera¨te mit a¨hnlichen Nutzungsverhalten bzw. a¨hnlichen Eigenschaften zu grup- pieren. Die Idee dahinter ist folgende, dass a¨hnliche Eigen- schaften der Gera¨te in der Nutzungsdatenbank auch zu a¨hnliche Nutzungsstatistiken fu¨hren. Aus den speziellen Eigenschaften der entstandenen Cluster lassen sich dann Ru¨ckschlu¨sse ziehen auf unterschiedliche Bedingungen fu¨r die unterschiedlich starken Nutzungseigenschaften der Gera¨te. Durch die Mo¨glichkeit vieler Noise-Punkte, die z.B. beim Bewegen der Gera¨te entstehen ko¨nnen, sind vielleicht auch Hierarchische Algorithmen oder dichte- basierte Algorithmen, wie der DB-Scan oder die incre- mentelle Version davon, eher zu nutzen in diesem Anwen- dungsfall. Aus den Clusteringergebnissen kann man den aktuellen Stand der Nutzung der einzelnen Gera¨te ablesen und Schlussfolgerungen daru¨ber ziehen, welche Eigenschaften das Nutzungsverhalten beeinflussen. Um Vorhersagen u¨ber die zuku¨nftige Nutzung der Gera¨te zu machen, sind die deskriptiven Modelle nicht geeignet. Hier mu¨ssen Vorhersagemodelle, wie Entscheidungsba¨ume zum Ein- satz kommen. Bei den Klassifikationsmodellen ist die Zielvariable die vorher ermittelte Nutzung der einzelnen Gera¨te. Da diese jedoch in einem Wertebereich von 0 bis zum gesamten Betrachtungszeitraum liegen kann, ist es einfacher, den Wertebereich in zwei (hoch, ger- ing) oder gar drei (hoch, mittel, gering) Intervalle zu un- terteilen, die den Grad der Nutzung angeben. Hierfu¨r lassen sich dann verschiedene Entscheidungsba¨ume testen oder z.B. auch der NaiveBayes Algorithmus, um zu se- hen welche Eigenschaften der Daten den einen oder an- deren Nutzungsgrad beeinflussen. Danach kann man mit unterschiedlichen Testdaten, in denen man die Werte der einzelnen Eigenschaften variiert, die einzelnen Nutzungs- grade vorhersagen. Aus diesen Vorhersagen lassen sich fu¨r ein Krankenhaus Ru¨ckschlu¨sse ziehen, welche Eigen- schaften intensiviert oder verringert werden sollen, damit die gewu¨nschten Nutzungsgrade der mobilen Gera¨te er- reicht werden. Eine weitere Mo¨glichkeit der Analyse der Daten ermo¨glicht die Trajectorenanalyse, wenn man die Bewe- gungsdaten der mobilen Gera¨te nicht als Noise-Punkte in- terpretiert sondern einfach als Daten eines Trajectories. Diese lassen sich dann auf A¨hnlichkeiten gruppieren bzw. lassen sich Besonderheiten feststellen. Dann lassen sich die einzelnen Wege der Gera¨te analysieren und als ein mo¨gliches Ergebnis daraus die Standorte fu¨r die Aufbe- wahrung optimieren. Diese Optimierung zielt darauf ab, die Laufwege des Schwesternpersonals zu verringern. Dies kann dadurch erreicht werden, in dem der Lagerort fu¨r die Gera¨te an einen aus den Trajectoren herausgefundenen zen- traleren Ort verlagert wird. 6 Zusammenfassung Wir haben gezeigt, dass die Ressourcenoptimierung in einem Krankenhaus viel Potenzial fu¨r den Einsatz von Data Mining Techniken bietet. Anhand der beschriebenen Problemstellung, dem Wiederfinden der mobilen Gera¨te, und den dargestellten Lo¨sungsansa¨tzen ko¨nnen sich die Nutzung der Ressourcen und der Aufwand fu¨r das nutzende Personal verringern lassen. Bei der Analyse der Auslas- tung der mobilen Gera¨te lassen sich beschreibende und vorhersagende Algorithmen fu¨r unterschiedliche Zielstel- lungen einsetzen. Die Umsetzung dieser Idee und die tatsa¨chlichen Auswirkungen der Nutzung von Data Mining bei der Ressourcenoptimierung werden wir im Projekt un- tersuchen. Als Anmerkung sei kurz erwa¨hnt, dass die hier beschriebenen Schritte derzeit in der Planung bzw. am An- fang der Umsetzung erst sind und somit noch keine exper- imentellen Ergebnisse vorliegen. Das Paper ist als Diskus- sionspapier gedacht. References [Alapont et al., 2004] J. Alapont, A. Bella-Sanjun, C. Ferri, J. Hernndez-Orallo, J. D. Llopis-Llopis, and M. J. Ramrez-Quintana. Specialised Tools for Automat- ing Data Mining for Hospital Management. Technical report, Universitat Politcnica de Valncia, Cam de vera s/n, 46022 Valencia, Spain, 2004. [Ko¨chlin, 2004] K. Ko¨chlin. Ist der EInsatz von Facility Management im Krankenhaus notwendig? Technical report, Technische FH Berlin, 2004. [Mauro et al., 2010] C. Mauro, J.M. Leimeister, and H. Krcmar. Serviceorientierte Integration medizinis- cher Gera¨te ganzheitliche IT-Unterstu¨tzung klinischer Prozesse. Informatik Spektrum, April 2010. [Na¨vy, 2006] Na¨vy. Facility Management - Grundla- gen, COmputerunterstu¨tzung, Systemeinfu¨hrung, An- wendungsbeispiele. Springer Verlag, 2006. [Odin, 2010] S. Odin. GEFMA Arbeitskreis Facility Man- agement im Krankenhaus - Benchmarking. Technical report, GEFMA e.V., 2010. [Pocsay and Distler, 2009] A. Pocsay and O Distler. Zukunftsorientierter Wandel im Krankenhausman- agement Outsourcing, IT-Nutzenpotenziale, Ko- operationsformen, Changemanagement, chapter Gescha¨ftsprozessmanagement im Gesundheitswesen Organisation und IT wachsen zusammen. Springer Verlag, April 2009. [Salfeld et al., 2009] Salfeld, Hehner, and Wichels. Modernes Krankenhaus Management, Konzepte und Lo¨sungen. Springer Verlag, 2009. 142 Probability Estimation and Aggregation for Rule Learning Jan-Nikolas Sulzmann, Johannes Fu¨rnkranz Abstract Rule learning is known for its descriptive and therefore comprehensible classification models which also yield good class predictions. For different classification models, such as decision trees, a variety of techniques for obtaining good probability estimates have been proposed and evaluated. However, so far, there has been no systematic empirical study of how these tech- niques can be adapted to probabilistic rules and how these methods affect the probability-based rankings. In this paper we apply several ba- sic methods for the estimation of class mem- bership probabilities to classification rules. We also study the effect of a shrinkage technique for merging the probability estimates of rules with those of their generalizations. Finally, we com- pare different ways of combining probability es- timates from an ensemble of rules. Our results show that for probability estimation it is bene- ficial to exploit the fact that rules overlap (i.e., rule averaging is preferred over rule sorting), and that individual probabilities should be combined at the level of rules and not at the level of theo- ries. 1 Introduction The main focus of symbolic learning algorithms such as de- cision tree and rule learners is to produce a comprehensible explanation for a class variable. Thus, they learn concepts in the form of crisp IF-THEN rules. On the other hand, many practical applications require a finer distinction be- tween examples than is provided by their predicted class labels. For example, one may want to be able to provide a confidence score that estimates the certainty of a predic- tion, to rank the predictions according to their probability of belonging to a given class, to make a cost-sensitive pre- diction, or to combine multiple predictions. All these problems can be solved straight-forwardly if we can predict a probability distribution over all classes in- stead of a single class value. A straight-forward approach to estimate probability distributions for classification rules is to compute the fractions of the covered examples for each class. However, this naı¨ve approach has obvious disadvan- tages, such as that rules that cover only a few examples may lead to extreme probability estimates. Thus, the probability estimates need to be smoothed. There has been quite some previous work on probabil- ity estimation from decision trees (so-called probability- estimation trees (PETS)). A very simple, but quite pow- erful technique for improving class probability estimates is the use of m-estimates, or their special case, the Laplace- estimates [Cestnik, 1990]. It has been shown that unpruned decision trees with Laplace-corrected probability estimates at the leaves produce quite reliable decision tree estimates [Provost and Domingos, 2003]. A recursive computation of the m-estimate, which uses the probability disctribution at level l as the prior probabilities for level l + 1, was pro- posed in [Ferri et al., 2003]. In [Wang and Zhang, 2006], a general shrinkage approach was used, which interpolates the estimated class distribution at the leaf nodes with the estimates in interior nodes on the path from the root to the leaf. An interesting observation is that, contrary to classifi- cation, class probability estimation for decision trees typi- cally works better on unpruned trees than on pruned trees. The explanation for this is simply that, as all examples in a leaf receive the same probability estimate, pruned trees provide a much coarser ranking than unpruned trees. In [Hu¨llermeier and Vanderlooy, 2009], a simple but elegant analysis of this phenomenon was provided, which shows that replacing a leaf with a subtree can only lead to an increase in the area under the ROC curve (AUC), a com- monly used measure for the ranking capabilities of an al- gorithm. Of course, this only holds for the AUC estimate on the training data, but it still may provide a strong indica- tion why unpruned PETs typically also outperform pruned PETs on the test set. Despite the amount of work on probability estimation for decision trees, there has been hardly any systematic work on probability estimation for rule learning. Despite their obvious similarility, we nevertheless argue that a separate study of probability estimates for rule learning is necessary. A key difference is that in the case of decision tree learn- ing, probability estimates will not change the prediction for an example, because the predicted class only depends on the probabilities of a single leaf of the tree, and such local probability estimates are typically monotone in the sense that they all maintain the majority class as the class with the maximum probability. In the case of rule learning, on the other hand, each example may be classified by multi- ple rules, which may possibly predict different classes. As many tie breaking strategies depend on the class probabili- ties, a local change in the class probability of a single rule may change the global prediction of the rule-based classi- fier, even if the order of all local estimates is maintained. Because of such non-local effects, it is not evident that the same methods that work well for decision tree learn- ing will also work well for rule learning. Indeed, as we will see in this paper, our conclusions differ from those that 143 have been drawn from similar experiments in decision tree learning. For example, the above-mentioned argument that unpruned trees will lead to a better (training-set) AUC than pruned trees, does not straight-forwardly carry over to rule learning, because the replacement of a leaf with a subtree is a local operation that only affects the examples that are covered by this leaf. In rule learning, on the other hand, each example may be covered by multiple rules, so that the effect of replacing one rule with multiple, more specific rules is less predictable. Moreover, each example will be covered by some leaf in a decision tree, whereas each rule learner needs to induce a separate default rule that covers examples that are covered by no other rule. The rest of the paper is organized as follows: In Sec- tion 2 we briefly describe the basics of probabilistic rule learning and discuss the used estimation techniques used for rule probabilities. In Section 3 we describe the rule learning algorithm that we used in our experiments, con- trast two approaches for the generation of a probabilistic rule set, and describe how they are used for classification. Experimental results, which compare the different proba- bility estimation techniques in these two scenarios, are de- scribed in Section 4. We then discuss four techniques for obtaining rule probabilities from a bagging ensemble (Sec- tion 5), and compare them experimentally in Section 6. In the end, we summarize our conclusions in Section 7.1 2 Rule Learning and Probability Estimation This section is divided into two parts. The first one de- scribes briefly the properties of conjunctive classification rules and of its extension to a probabilistic rule. In the sec- ond part we introduce the probability estimation techniques used in this paper. These techniques can be divided into ba- sic methods, which can be used stand-alone for probability estimation, and the meta technique shrinkage, which can be combined with any of the techniques for probability es- timation. 2.1 Probabilistic Rule Learning In classification rule mining one searches for a set of rules that describes the data as accurately as possible. As there are many different generation approaches and types of gen- erated classification rules, we do not go into detail and re- strict ourselves to conjunctive rules. The premise of these rules consists of a conjunction of number of conditions, and in our case, the conclusion of the rule is a single class value. So a conjunctive classification rule r has basically the fol- lowing form: condition1 ∧ · · · ∧ condition|r| =⇒ class (1) The size of a rule |r| is the number of its conditions. Each of these conditions consists of an attribute, an attribute value belonging to its domain and a comparison determined by the attribute type. For our purpose, we consider only nominal and numerical attributes. For nominal attributes, this comparison is a test of equality, whereas in the case of numerical attributes, the test is either less (or equal) or greater (or equal). If all conditions are met by an in- stance, the instance is covered by the rule (r ⊇ x) and the class value of the rule is predicted for the instance. Conse- quently, the rule is called a covering rule for this instance. 1Parts of this paper have previously appeared as [Sulzmann and Fu¨rnkranz, 2009]. This in mind, we can define some statistical values of a data set which are needed for later definitions. A data set consists of |C| classes and n instances from which nc belong to the class c respectively (n = ∑|C| c=1 n c). A rule r covers nr instances which are distributed over the classes, so that ncr instances belong to class c (nr = ∑|C| c=1 n c r). A probabilistic rule r is an extension of a classification rule, which does not only predict a single class value, but a set of class probabilities, which form a probability dis- tribution over the classes. This probability distribution es- timates all probabilities that a covered instance belongs to any of the class in the data set, so we get one class probabil- ity per class. The example is then classified with the most probable class. The probability that an instance x covered by rule r belongs to c can be viewed as a conditional prob- ability Pr(c|r ⊇ x). Thus the set of class probabilities can be noted as vector of probabilities sorted by the class order- ing: ~Pr(r ⊇ x) = (Pr(c1|r ⊇ x), · · · ,Pr(c|C||r ⊇ x)) (2) On the vector ~Pr(r ⊇ x), abbreviated ~Prr(x), we define the following maximum function max ( ~Prr(x) ) = max c∈C Pr(c|r ⊇ x). (3) On sets of class probability vectors ⋃k j=1 ~Pri(x) we define the average function avg (⋃k j=1 ~Prj(x) ) = 1 k k∑ j=0 ~Prj(x) (4) and the multiplication mult (⋃k j=1 ~Prj(x) ) = (∏k j=0 Pr(c1|rj ⊇ x), · · · ,∏k j=0 Pr(c|C||rj ⊇ x) ) (5) Obviously the results of these functions are also class prob- ability vectors. In the next section, we discuss some approaches for es- timating these class probabilities. 2.2 Basic Probability Estimation In this subsection we will review three basic methods for probability estimation. Subsequently, in Section 2.3, we will describe a technique known as shrinkage, which is known from various application areas, and show how this technique can be adapted to probabilistic rule learning. All of the three basic methods we employed, calculate the relation between the number of instances covered by the rule nr and the number of instances covered by the rule but also belong to a specific class ncr. The differences between the methods are the minor modifications of the calculation of this relation. The simplest approach to rule probability estimation di- rectly estimates a class probability distribution of a rule with the fraction of examples that belong to each class. Pr naı¨ve (c|r ⊇ x) = n c r nr (6) This naı¨ve approach has several well-known disadvantages, most notably that rules with a low coverage may be lead to extreme probability values. For this reason, the use of 144 the Laplace- and m-estimates was suggested in [Cestnik, 1990]. The Laplace estimate modifies the above-mentioned re- lation by adding one additional instance to the counts ncr for each class c. Hence the number of covered instances nr is increased by the number of classes |C|. Pr Laplace (c|r ⊇ x) = n c r + 1 nr + |C| (7) It may be viewed as a trade-off between Prnaı¨ve(c|r ⊇ x) and an a priori probability of Pr(c) = 1/|C| for each class. Thus, it implicitly assumes a uniform class distribution. The m-estimate generalizes this idea by making the de- pendency on the prior class distribution explicit, and intro- ducing a parameter m, which allows to trade off the influ- ence of the a priori probability andPrnaı¨ve. Pr m (c|r ⊇ x) = n c r +m · Pr(c) nr +m (8) The m-parameter may be interpreted as a number of ex- amples that are distributed according to the prior proba- bility, which are added to the class frequencies ncr. The prior probability is typically estimated from the data using Pr(c) = nc/n (but one could, e.g., also use the above- mentioned Laplace-correction if the class distribution is very skewed). Obviously, the Laplace-estimate is a special case of the m-estimate with m = |C| and Pr(c) = 1/|C|. 2.3 Shrinkage Shrinkage is a general framework for smoothing probabil- ities, which has been successfully applied in various re- search areas.2Its key idea is to “shrink” probability esti- mates towards the estimates of its generalized rules rk, which cover more examples. This is quite similar to the idea of the Laplace- and m-estimates, with two main dif- ferences: First, the shrinkage happens not only with respect to the prior probability (which would correspond to a rule covering all examples) but interpolates between several dif- ferent generalizations, and second the weights for the trade- off are not specified a priori (as with the m-parameter in the m-estimate) but estimated from the data. In general, shrinkage estimates the probability Pr(c|r ⊇ x) as follows: Pr Shrink (c|r ⊇ x) = |r|∑ k=0 wkc Pr(c|rk) (9) where wkc are weights that interpolate between the proba- bility estimates of the generalized rules rk. In our imple- mentation, we use only generalizations of a rule that can be obtained by deleting a final sequence of conditions. Thus, for a rule with length |r|, we obtain |r|+ 1 generalizations rk, where r0 is the rule covering all examples, and r|r| = r. The weights wkc can be estimated in various ways. We employ a shrinkage method proposed in [Wang and Zhang, 2006] which is intended for decision tree learning but can be straight-forwardly adapted to rule learning. The authors propose to estimate the weights wkc with an iterative proce- dure which averages the probabilities obtained by remov- ing training examples covered by this rule. In effect, we obtain two probabilities per rule generalization and class: the removal of an example of class c leads to a decreased 2Shrinkage is, e.g., regularly used in statistical language pro- cessing [Chen and Goodman, 1998; Manning and Schu¨tze, 1999] probability Pr−(c|rk ⊇ x), whereas the removal of an ex- ample of a different class results in an increased probabil- ity Pr+(c|rk ⊇ x). Weighting these probabilities with the relative occurrence of training examples belonging to this class we obtain a smoothed probability Pr Smoothed (c|rk ⊇ x) = n c r nr · Pr−(c|rk ⊇ x) (10) + nr − ncr nr · Pr+(c|rk ⊇ x) (11) Using these smoothed probabilities, this shrinkage method computes the weights of these nodes in linear time (linear in the number of covered instances) by normalizing the smoothed probabilities separately for each class. wkc = PrSmoothed(c|rk ⊇ x)∑|r| i=0 PrSmoothed(c|ri ⊇ x) (12) Multiplying the weights with their corresponding prob- ability we obtain “shrinked” class probabilities for the in- stance. Note that all instances which are classified by the same rule receive the same probability distribution. Therefore the probability distribution of each rule can be calculated in advance. 3 Rule Learning Algorithm For the rule generation we employed the rule learner Rip- per [Cohen, 1995], arguably one of the most accurate rule learning algorithms today. We used Ripper both in ordered and in unordered mode: Ordered Mode: In ordered mode, Ripper learns rules for each class, where the classes are ordered according to ascending class frequencies. For learning the rules of class ci, examples of all classes cj with j > i are used as negative examples. No rules are learned for the last and most frequent class, but a rule that implies this class is added as the default rule. At classification time, these rules are meant to be used as a decision list, i.e., the first rule that fires is used for prediction. Unordered Mode: In unordered mode, Ripper uses a one- against-all strategy for learning a rule set, i.e., one set of rules is learned for each class ci, using all examples of classes cj , j 6= i as negative examples. At predic- tion time, all rules that cover an example are consid- ered and the rule with the maximum probability esti- mate is used for classifying the example. If no rule covers the example, it classified by the default rule predicting the majority class. We used JRip, the Weka [Witten and Frank, 2005] imple- mentation of Ripper. Contrary to William Cohen’s original implementation, this re-implementation does not support the unordered mode, so we had to add a re-implementation of that mode.3 We also added a few other minor modifi- cations which were needed for the probability estimation, e.g. the collection of statistical counts of the sub rules. In addition, Ripper (and JRip) can turn the incremen- tal reduced error pruning technique [Fu¨rnkranz and Wid- mer, 1994; Fu¨rnkranz, 1997] on and off. Note, however, 3Weka supports a general one-against-all procedure that can also be combined with JRip, but we could not use this because it did not allow us to directly access the rule probabilities. 145 that with turned off pruning, Ripper still performs pre- pruning using a minimum description length heuristic [Co- hen, 1995]. We use Ripper with and without pruning and in ordered and unordered mode to generate four set of rules. For each rule set, we employ several different class proba- bility estimation techniques. In the test phase, all covering rules are selected for a given test instance. Using this reduced rule set we deter- mine the most probable rule. For this purpose we select the most probable class of each rule and use this class value as the prediction for the given test instance and the class prob- ability for comparison. Ties are solved by predicting the least represented class. If no covering rules exist the class probability distribution of the default rule is used. 4 Experimental Results 4.1 Experimental Setup We performed our experiments within the WEKA frame- work [Witten and Frank, 2005]. We tried each of the four configuration of Ripper (unordered/ordered and pruning/no pruning) with 5 different probability estimation techniques, Naı¨ve (labeled as Precision), Laplace, and m-estimate with m ∈ {2, 5, 10}, both used as a stand-alone probability es- timate (abbreviated with B) or in combination with shrink- age (abbreviated with S). As a baseline, we also included the performance of pruned or unpruned standard JRip ac- cordingly. Our unordered implementation of JRip using Laplace stand-alone for the probability estimation is com- parable to the unordered version of Ripper (Cohen, 1995), which is not implemented in JRip. We evaluated these methods on 33 data sets of the UCI repository [Asuncion and Newman, 2007] which differ in the number of attributes (and their categories), classes and training instances. As a performance measure, we used the weighted area under the ROC curve (AUC), as used for probabilistic decision trees in [Provost and Domingos, 2003]. Its key idea is to extend the binary AUC to the multi- class case by computing a weighted average the AUCs of the one-against-all problems Nc, where each class c is paired with all other classes: AUC(N) = ∑ c∈C nc |N | AUC(Nc) (13) For the evaluation of the results we used the Friedman test with a post-hoc Nemenyi test as proposed in [Demsar, 2006]. The significance level was set to 5% for both tests. We only discuss summarized results here, detailed result tables can be found in [Sulzmann and Fu¨rnkranz, 2009] and [Sulzmann and Fu¨rnkranz, 2010]. 4.2 Ordered Rule Sets In the first two test series, we investigated the ordered ap- proach using the standard JRip approach for the rule gener- ation, both with and without pruning. The basic probability methods were used standalone (B) or in combination with shrinkage (S). The Friedman test showed that in both test series, the em- ployed combinations of probability estimation techniques showed significant differences. Considering the CD chart of the first test series (Figure 1(a)), one can identify three groups of equivalent techniques. Notable is that the two best techniques, the m-Estimate used stand-alone with m = 2 and m = 5 respectively, belong only to the best group. These two are the only methods that are signifi- cantly better than the two worst methods, Precision used stand-alone and Laplace combined with shrinkage. On the other hand, the naı¨ve approach seems to be a bad choice as both techniques employing it rank in the lower half. However our benchmark JRip is positioned in the lower third, which means that the probability estimation tech- niques clearly improve over the default decision list ap- proach implemented in JRip. Comparing the stand-alone techniques with those em- ploying shrinkage one can see that shrinkage is outper- formed by their stand-alone counterparts. Only Precision is an exception as shrinkage yields increased performance in this case. In the end shrinkage is not a good choice for this scenario. The CD-chart for ordered rule sets with pruning (Figure 1(b)) features four groups of equivalent techniques. No- table are the best and the worst group which overlap only in two techniques, Laplace and Precision used stand-alone. The first group consists of all stand-alone methods and JRip which dominates the group strongly covering no shrinkage method. The last group consists of all shrinkage methods and the overlapping methods Laplace and Precision used stand-alone. As all stand-alone methods rank before the shrinkage methods, one can conclude that they outperform the shrinkage methods in this scenario as well. Ripper per- forms best in this scenario, but the difference to the stand- alone methods is not significant. 4.3 Unordered Rule Sets Test series three and four used the unordered approach em- ploying the modified JRip which generates rules for each class. Analogous to the previous test series the basic meth- ods are used as stand-alone methods or in combination with shrinkage (left and right column respectively). Test series three used no pruning while test series four did so. The re- sults of the Friedman test showed that the techniques of test series three and test series four differ significantly. Regarding the CD chart of test series three (Figure 1(c)), we can identify four groups of equivalent methods. The first group consists of all stand-alone techniques, except for Precision, and the m-estimates techniques combined with shrinkage and m = 5 and m = 10, respectively. Whereas the stand-alone methods dominate this group, m = 2 being the best representative. Apparently these methods are the best choices for this scenario. The second and third con- sist mostly of techniques employing shrinkage and overlap with the worst group in only one technique. However our benchmark JRip belongs to the worst group being the worst choice of this scenario. Additionally the shrinkage methods are outperformed by their stand-alone counterparts. The CD chart of test series four (Figure 1(d)) shows sim- ilar results. Again four groups of equivalent techniques groups can be identified. The first group consists of all stand-alone methods and the m-estimates using shrinkage and m = 5 and m = 10 respectively. This group is dom- inated by the m-estimates used stand-alone with m = 2, m = 5 or m = 10. The shrinkage methods are distributed over the other groups, again occupying the lower half of the ranking. Our benchmark JRip is the worst method of this scenario. 4.4 Unpruned vs. Pruned Rule Sets Rule pruning had mixed results, which are briefly summa- rized in Table 1. On the one hand, it improved the results of the ordered approach, on the other hand it worsened the results of the unordered approach. In any case, in our ex- 146 (a) Ordered rule sets without pruning (b) ordered rule sets with pruning (c) Unordered rule sets without pruning (d) Unordered rule sets with pruning Figure 1: Critical distance charts of ordered rule sets ((a) and (b)) and unordered rule sets ((c) and (d)) periments, contrary to previous results on PETs, rule prun- ing was not always a bad choice. The explanation for this result is that in rule learning, contrary to decision tree learn- ing, new examples are not necessarily covered by one of the learned rules. The more specific rules become, the higher is the chance that new examples are not covered by any of the rules and have to be classified with a default rule. As these examples will all get the same default probability, this is a bad strategy for probability estimation. Note, however, that JRip without pruning, as used in our experiments, still performs an MDL-based form of pre-pruning. We have not yet tested a rule learner that performs no pruning at all, but, because of the above deliberations, we do not expect that this would change the results with respect to pruning. 5 Probability Estimates from Rule Ensembles Instead of trying to improve the probability estimates for each individual leaf, one can also resort to averaging mul- tiple estimates, thereby reducing the variance of the re- sulting probability estimates. For example, a technique based on bagging multiple unpruned decision trees was used in [Domingos, 1999] to obtain improved probability estimates, which were subsequently used for cost-sensitive classification. An adaptation of this technique to rule learn- ing again has the effect that an example may be covered by a varying number of rules, whereas in decision tree learn- ing, each example will be covered by exactly s leafs (where s is the number of trees learned). For investigating the performance ensemble-based prob- ability estimates, we combined our rule learner with bag- ging [Breiman, 1996]. We generated s bootstrap samples S1, · · · , Ss by repeately drawing n instances with replace- ment. The rule algorithm described in Section 3 was ap- plied to each sampled data set si obtaining s classifiers C1, · · · , Cs and their corresponding rule sets R1, · · · , Rs. The probability estimation for each is computed using the previously introduced basic estimation methods. For prediction, we determine the covering rules of each sampled rule set Ri for a given example x Ri(x) = {r ∈ Ri|r ⊇ x} (14) Let Covi(x) denote the set of all class probability distribu- tions that originate from a rule in the set of covering rules Covi(x) = { ~Prr(x)|r ∈ Ri(x) } . (15) From this set of class probability distributions of the cov- ering rules, we try to estimate a class probability distribu- tion for the given example x. For this purpose we have to decode the probability estimations of these covering rules ~Prr(x) into a single normalized global class probability distribution ~Prglobal(x). For our approach we considered four decoding methods. The first three methods have in common that they average the class probability distribution of (some of) the covering rules. Best Rule: Only the most confident covering rule ~Pri(x) of each covering rule set Ri(x) is determined. ~Pri(x) = argmax ~Pr(x)∈Covi ( ~Pr(x) ) (16) Afterwards the class probability distributions of these rules are averaged and the result is normalized: ~Prglobal(x) = avg ( ~Pr1, · · · , ~Prs ) ∥∥∥avg ( ~Pr1, · · · , ~Prs)∥∥∥ 1 . (17) Macro Averaging: All covering rules are determined and their class probability distributions are macro- averaged in two steps. First the class probability dis- tributions Covi(x) of each covering rule set Ri(x) are averaged and normalized: ~Pri = avg (Covi(x)) ‖avg (Covi(x))‖1 . (18) These local class probabilities are then averaged as above (17). Micro Averaging: All covering rules are determined and their class probability distributions are micro- averaged. Essentially, this means that all learned rules are pooled, and the average is formed over the result- ing set set of rules: ~Prglobal(x) = avg (Cov1(x) ∪ · · · ∪ Covs(x)) ‖avg (Cov1(x) ∪ · · · ∪ Covs(x))‖1 (19) Bayesian Decoding: All covering rules are pooled as above, but their class probability distributions are mul- tiplied with each other and with the vector of the a priori class probabilities ~Prprior = ( Pr(c1), · · · ,Pr(c|C| ) . 147 Table 1: Unpruned vs. pruned rule sets: Win/Loss for ordered (top) and unordered (bottom) rule sets JRip Precision Laplace M 2 M 5 M 10 Win 26 23 19 20 19 18 20 19 20 19 20 Loss 7 10 14 13 14 15 13 14 13 14 13 Win 26 21 9 8 8 8 8 8 8 8 6 Loss 7 12 24 25 25 25 25 25 25 25 27 Thus ~Prglobal(x) is calculated as follows ~Prglobal(x) = mult (⋃s i=1 Covi(x) ∪ { ~Prprior }) ∥∥∥mult(⋃si=1 Covi(x) ∪ { ~Prprior})∥∥∥ 1 (20) In all cases, the resulting class probability distribution ~Prglobal(x) is used for the prediction. For this purpose the most probable class according to ~Prglobal(x) is selected and this class value is used as the prediction for the given test instance x. Ties are solved by predicting the least rep- resented class. If no covering rules (Covi(x) = ∅) exist for a sampled rule set Ri the class probability distribution of the default rule is used accordingly. 6 Results on Ensemble-Based Probability Estimates 6.1 Experimental Setup The above methods were integrated into the framework de- scribed in Section 4.1. For sampling, we employed the un- supervised random sampling of WEKA for the generation of the bootstrap samples and applied the Bagging imple- mentation of WEKA to JRip. In accordance with the re- sults obtained above, we only used unordered rule sets for these experiments because these produce better probability estimates than ordered rule sets. Furthermore we know that the unpruned rule sets perform better for the Laplace- and m-estimates than pruned rule sets if the rule sets are gen- erated by the unordered JRip. For Precision the opposite is true. So we employed these basic probability estima- tion techniques on either unpruned or pruned rule sets ac- cording to these observations. As the employed shrinkage method worsened the probability estimation we abstained from using shrinkage in these experiments. All previously mentioned decoding methods - Best rule, Macro and Mi- cro Averaging, and Bayesian Decoding - were employed. So we computed all combinations of the basic probability estimation and decoding methods on a different number of bootstrap samples - 10, 20, 50 and 100 samples accord- ingly. These methods were compared to the default config- uration of JRip (pruned, ordered rule sets) and to its bagged version (with or without pruning) using the same bootstrap samples. All methods were evaluated on the 33 previously used data sets of the UCI repository. As a performance mea- sure, we used the weighted area under the ROC curve (AUC) also. For our comparison we calculated the average weighted AUC over all data sets for all combinations (com- bining a probability estimation technique and a decoding method to a given number of boostrap samples), the sum- marized results are depicted in Figure 2. Detailed results can be found in [Sulzmann and Fu¨rnkranz, 2010]. 6.2 Comparison to JRip In our experiments we compared JRip and its bagged ver- sions to the basic probability estimation techniques em- ploying the four decoding methods. Figure 2 shows the results of unpruned bagged JRip and the basic probability estimation techniques. We omitted to depict the results of JRip and the pruned bagged JRip because they both had a worse performance than the unpruned bagged version which has the worst performance of the depicted methods. The weighted AUC of their best representative, unpruned bagged JRip, was always at least two absolute percentage points lower than the weighted AUC of the basic probabil- ity estimation techniques for all decoding methods and all numbers of samples. So we can conclude that the perfor- mance of the basic probability estimation techniques im- prove over the default probability estimation integrated in JRip. As this observation was also made in the basic rule learning experiments, we see our approach reconfirmed. 6.3 Comparison of the base probability estimation techniques Each of the results of Figure 2 is the average perfor- mance over several different base probability estimation techniques: Precision applied to pruned rule sets and the Laplace and the m-estimate (m ∈ {2, 5, 10}) applied to unpruned rule sets. The applied decoding methods seem to have only a small impact on the ranking (according the weighted AUC) of the basic probability estimation tech- niques. Especially for a higher number of bootstrap sam- ples, 50 and 100, these rankings are always the same - Precision having the best performance followed by the m- estimate using m = 2, m = 5, or m = 10 and the Laplace estimate in this order. For the smaller numbers of sam- ples, 10 and 20, the ranking is a little bit more dynamic. Although the best two methods, m-estimate with m = 2 and m = 5 (in this order), and the worst method, Laplace estimate, are always the same, the methods in the center, Precision and the m-estimate using m = 10, switch places dependent on the decoding methods. So we can conclude that the m-estimate using m = 2 is the best choice for a low number of samples just as Precision is the best choice for a higher number of samples. For all probability esti- mation techniques holds that an increase in the number of samples leads to an improvement in the weighted AUC but the gain is a bit lower than the gain of the bagged JRip. 6.4 Comparison of the decoding methods In this section, we want to compare the four employed de- coding methods: Best Rule, Macro and Micro Averaging, and Bayesian Decoding. For this purpose we calculated the average weighted AUC over all data sets for all com- binations obtained by combining a probability estimation technique, a decoding method and a number of bootstrap samples. Afterwards, we determined on this data the aver- age rank of each decoding method (Table 2) and used this information for a Friedman test with a significance of 5%. According to this test, the decoding methods differ signif- icantly, so we applied a post-hoc Nemenyi test which is depicted in a CD chart (Figure 3). 148 Table 2: Decoding Methods: Count of each rank and the average rank Ranked Average Decoding Method 1st 2nd 3rd 4th Rank Best Rule 0 0 4 16 3.80 Macro Averaging 0 17 3 0 2.15 Micro Averaging 20 0 0 0 1.00 Bayesian Decoding 0 3 13 4 3.05 Obviously, Micro Averaging is the best decoding meth- ods in our experiments as it always placed first according to the average weighted AUC. This observation is recon- firmed by the CD chart since the Micro Averaging is the only member of the best group of methods. Macro Av- eraging and Bayesian Decoding do not differ significantly being both in the second best group of decoding methods. Nevertheless Bayesian Decoding is also in the worst group together with Best Rule which is the worst choice in our experiments. The observed ranking of the decoding methods can be at- tributed to the individual exploitation of the covering rules. As the decoding method Best Rule only uses the best rule of each bootstrap sample, a great deal of evidence has no influence on the probability estimation. Thus, it is not sur- prising that Best Rule ranks behind the methods that make better use of the ensemble. The methods Macro and Micro Averaging both average the probability distributions of a number of covering rules but their averaging approaches differ. So the influence of a high evidence for a class in a sampled rule set is also different for the two methods. For Micro Averaging the aforementioned evidence has a direct effect on resulting probability distribution as all covering rules have the same weight in its calculation. As our rule sets only have a low redundancy, this effect is desirable. For Macro Averaging a high evidence in a sampled rule set influences only the probability distribution of this bootstrap sample. So the number of covering rules has no effect on the global prob- ability distribution. As Macro Averaging partially discards the available information Micro Averaging should perform better than Macro Averaging as observed in our experi- ments. Bayesian Decoding uses all the information contained in the covering rules as Micro Averaging does. These two methods differ only the way how they combine the infor- mation of the covering rule, averaging or multiplying their probability distributions. The multiplication used in the Bayesian approach has a tendency to prefer a number of medium probabilities to a balanced number of low and high probabilities. This bias has a negative effect on the calcu- lation of the global probability distribution. Averaging is more desirable as high probabilities have a greater impact on its calculation. Figure 3: CD chart of the decoding methods 7 Conclusions The most important result of our study is that probability estimation is clearly an important part of a good rule learn- ing algorithm. The probabilities of rules induced by JRip can be improved considerably by simple estimation tech- niques. In unordered mode, where one rule is generated for each class, JRip is outperformed in every scenario. On the other hand, in the ordered setting, which essentially learns decision lists by learning subsequent rules in the context of previous rules, the results were less convincing, giving a clear indication that the unordered rule induction mode should be preferred when a probabilistic classfication is de- sirable. Among the tested probability estimation techniques, the m-estimate typically outperformed the other methods in our version of the JRip algorithm. The superiority of the m-estimate was not sensitive to the choice of its parame- ter. When combined with an ensemble-based approach, the m-estimate maintained its superiority for smaller number of bootstrap samples, but typically lower value of m per- formed better. For higher numbers of bootstrap samples, precision, which corresponds to a value of m = 0, outper- formed the other methods. Thus, it seems to be the case that the use of the m-estimate primarily helps to reduce the variance of the probability estimates. The employed shrinkage method did, in general, not im- prove the simple estimation techniques. It remains to be seen whether alternative ways of setting the weights could yield superior results. Rule pruning did not produce the bad results that are known from ranking with pruned decision trees, presumably because unpruned, overly specific rules will increase the number of uncovered examples, which in turn leads to bad ranking of these examples. Ensemble-based probability estimation based on a bag- ging approach further improved the probability estimates. The improvement increases with the number of bootstrap samples. In every case, the probabilistic approach outper- formed Bagged JRip using the same number of bootstrap samples. Amongst the employed decoding methods, Micro Averaging of the probability distributions of all covering rules was without exceptions superior to the other meth- ods, indicating that the predictions of rule-based ensembles should be combined at the level of individual rules and not at the level of theories. Acknowledgements This research was supported by the German Science Foundation (DFG) under grant FU 580/2. References A. Asuncion and D. J. Newman. UCI machine learning repository, 2007. Leo Breiman. Bagging predictors. Machine Learning, 24(2):123– 140, 1996. Bojan Cestnik. Estimating probabilities: A crucial task in Ma- chine Learning. In L. Aiello, editor, Proceedings of the 9th Eu- ropean Conference on Artificial Intelligence (ECAI-90), pages 147–150, Stockholm, Sweden, 1990. Pitman. Stanley F. Chen and Joshua T. Goodman. An empirical study of smoothing techniques for language modeling. Technical Re- port TR-10-98, Computer Science Group, Harvard University, Cambridge, MA, 1998. 149 (a) Best Rule (b) Macro Averaging (c) Micro Averaging (d) Bayesian Decoding Figure 2: Average Weighted AUC of the employed decoding methods on unpruned rule sets (where stated the pruned rule set is used) William W. Cohen. Fast effective rule induction. In A. Priedi- tis and S. Russell, editors, Proceedings of the 12th Interna- tional Conference on Machine Learning (ML-95), pages 115– 123, Lake Tahoe, CA, 1995. Morgan Kaufmann. Janez Demsar. Statistical comparisons of classifiers over multi- ple data sets. Journal of Machine Learning Research, 7:1–30, 2006. Pedro Domingos. Metacost: A general method for making clas- sifiers cost-sensitive. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-99), pages 155–164, San Diego, CA, 1999. ACM. Ce´sar Ferri, Peter A. Flach, and Jose´ Herna´ndez-Orallo. Improv- ing the AUC of probabilistic estimation trees. In Nada Lavrac, Dragan Gamberger, Ljupco Todorovski, and Hendrik Block- eel, editors, Proceedings of the 14th European Conference on Machine Learning, pages 121–132, Cavtat-Dubrovnik, Croa- tia, 2003. Springer. Johannes Fu¨rnkranz and Gerhard Widmer. Incremental Reduced Error Pruning. In William W. Cohen and H. Hirsh, editors, Proceedings of the 11th International Conference on Machine Learning (ML-94), pages 70–77, New Brunswick, NJ, 1994. Morgan Kaufmann. Johannes Fu¨rnkranz. Pruning algorithms for rule learning. Ma- chine Learning, 27(2):139–171, 1997. Eyke Hu¨llermeier and Stijn Vanderlooy. Why fuzzy decision trees are good rankers. IEEE Transactions on Fuzzy Systems, 17(6):1233–1244, 2009. Christopher D. Manning and Hinrich Schu¨tze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA, 1999. Foster J. Provost and Pedro Domingos. Tree induction for probability-based ranking. Machine Learning, 52(3):199–215, 2003. Jan-Nikolas Sulzmann and Johannes Fu¨rnkranz. An empirical comparison of probability estimation techniques for probabilis- tic rules. In Joa˜o Gama, Vı´tor Santos Costa, A. Jorge, and Pavel B. Brazdil, editors, Proceedings of the 12th International Conference on Discovery Science (DS-09), pages 317–331. Springer-Verlag, 2009. Winner of Best Student Paper Award. Jan-Nikolas Sulzmann and Johannes Fu¨rnkranz. Probability es- timation and aggregation for rule learning. Technical Re- port TUD-KE-2010-03, TU Darmstadt, Knowledge Engineer- ing Group, 2010. Bin Wang and Harry Zhang. Improving the ranking performance of decision trees. In Johannes Fu¨rnkranz, T. Scheffer, and M. Spiliopoulou, editors, Proceedings of the 17th European Conference on Machine Learning (ECML-06), pages 461–472, Berlin, Germany, 2006. Springer-Verlag. Ian H. Witten and Eibe Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Fran- cisco, CA, 2nd edition, 2005. 150 Conditional Random Fields For Local Adaptive Reference Extraction Martin Toepfer and Peter Kluegl and Andreas Hotho and Frank Puppe University of Würzburg, Department of Computer Science VI Am Hubland, 97074 Würzburg, Germany {toepfer, pkluegl, hotho, puppe}@informatik.uni-wuerzburg.de Abstract The accurate extraction of bibliographic infor- mation from scientific publications is an active field of research. Machine learning, especially sequence labeling approaches like Conditional Random Fields (CRF), are often applied for this reference extraction task, but still suffer from the ambiguity of reference notation. Reference sec- tions apply a predefined style guide and contain only homogeneous references. Therefore, other references of the same paper or journal often can provide evidence how the fields of a reference are correctly labeled. We propose a novel ap- proach that exploits the similarities within a doc- ument. Our process model uses information of unlabeled documents directly during the extrac- tion task in order to automatically adapt to the perceived style guide. This is implemented by changing the manifestation of the features for the applied CRF. The experimental results show con- siderable improvements compared to the com- mon approach. We achieve an average F1 score of 96.7% and an instance accuracy of 85.4% on the test data set. 1 Introduction Reference sections of research papers are a valuable source for many interesting applications. A considerable amount of research has been spent on creating and analyzing cita- tion graphs, yielding information about research commu- nities and topics. Social bookmarking services like Bib- sonomy1 on the other hand have become essential tools for researchers and facilitate the management of bibliographic data. Both applications, citation analysis and bookmarking services, rely on a structured representation of the reference data. Often the well-known BibTex format is used to de- fine the different fields of an information. The acquisition of this structured data demands for an automatic process- ing of the vast amount of the unstructured data available in publications. The knowledge for an automatic extraction of refer- ences can be formalized using rules or templates. How- ever, the handcrafting of rules is tedious and prone to er- ror due to the knowledge engineering bottleneck. Several publications have shown that machine learning and espe- cially sequence labeling approaches are more suitable for the reference extraction task [Peng and McCallum, 2004; 1http://www.bibsonomy.org/ Councill et al., 2008]. These methods learn a statistical model using training sets where the interesting information, namely the BibTeX fields, is already labeled. The model is applied on newly and unseen documents in order to iden- tify the information in unlabeled data. Hence, the model is only adapted offline on the previously seen documents of the training phase. Although these approaches achieve remarkable results, the heterogeneous styles of the refer- ences make a suitable generalization difficult and decrease the accuracy of the extraction task. The IEEE style, for ex- ample, separates the author and the title with a comma and surrounds the title with quotes. Whereas the ACM style applies no separator for the author and the date is located between the author and the title. The MISQ style surrounds the title also with quotes, but uses no separator for the au- thor. Nevertheless, the input data of the extraction task, i.e., the reference section of scientific publications, follows a single style guide. The references within a paper or jour- nal are usually homogenous. In order to utilize these local consistencies the model has to be adapted during the ex- traction phase, because the applied style guide is identified as the document is processed. This is not possible using the common process model. In this paper, we propose a local adaptive information extraction approach using sequence labeling methods, es- pecially Conditional Random Fields. That is a novel ex- tension of the common process model with an automatic adaption to the previously unknown style guide. We apply two stacked models that are trained offline. The first model is applied to gain information about the document’s struc- ture and to create a description of the reference notation based on the available features. The style guides differ in their characteristics of field separation and alignment. As a result, the description, also called local model, is based on different features dependent on the currently processed document. This information is used to create style-specific features which have a steady meaning for the information extraction task. However, the manifestation of these meta features differs between documents and depends on the ap- plied style guide. The new features are then added to the features of the second model helping to resolve ambiguities and to increase the extraction accuracy. As a result, the pre- sented approach achieves considerably better results than a single Conditional Random Field. The rest of the paper is structured as follows: Section 2 introduces the novel combination of methods and gives a detailed description of all parts of the process model. Then, the evaluation setting and experimental results are presented and discussed in section 3. Section 4 gives a short overview of the related work and section 5 concludes with a summary of the presented work. 151 Common Process Model Meta Phase Common Process Model Bootstrap Phase Document with references Conditional Random Field Conditional Random Field Initially extracted information Document with information References and local model Local Model Cooccurrences of information and features indicate applied style guide Use description as features Analyze document Figure 1: Overview of the applied process model with three phases: the bootstrap, local model and meta phase. 2 Method Machine learning and sequence labeling approaches are of- ten applied for reference extraction and Conditional Ran- dom Fields (CRF) are one of the most popular techniques for this task. Normally, a simple process model is used: The feature extraction identifies valuable properties in the unstructured data. They are used by a given model in or- der to extract interesting information, that is, labeling the fields of a reference. The extraction model is trained on la- beled examples in a previous phase. Therefore, the model is only adapted offline in the learning process on the global consistencies of the domain. This prevents a good gen- eralization for the global model and induces errors in the extracted information. In order to overcome this problem, the local patterns and the consistency of one document, the applied style guide in this domain respectively, need to be addressed directly for a resolution of the ambiguity. The style guide can however be identified as soon as the doc- ument is processed. Hence, the common process model is incompatible to an online adaption during the extraction process on the local consistency. The presented approach tries to utilize the common pro- cess model with CRFs in a novel combination. The unla- beled documents are used to identify the applied style guide directly during the extraction process. Then, this informa- tion about the homogenous notation within the current doc- ument is exploited to increase the extraction accuracy in an additional phase. The model is learnt offline, but the fea- tures it is based on are adapted online during the extraction process of each single document. Since the process can ad- just to the local consistencies, it is called local adaptive. Figure 1 provides an overview of the applied process that consists of three stages: the bootstrap, the construction of the local model and the meta phase. The purpose of the bootstrap phase is to provide the fundamental information that is needed to perceive style information for a document. We employ a common process model as it can be found in previous CRF approaches. Features are extracted from the current document and a previously learnt (base) model is applied in order to gain information. However, this is just an intermediate step required for an examination of the lo- cal information patterns. The second phase, the construc- tion of the local model, tries to create a description of the applied style guide. This is achieved by investigating the cooccurrence of information and features and the selection of features that describe the different characteristics of the style guide very well. Finally, the acquired style informa- tion is used to create special features, called meta features. These possess a different manifestation for each document. The meta phase is the final step of the process model. It is built on the common process model of conditional random fields, but uses an enhanced set of features. Additionally to the base features of the bootstrap, it also considers the new meta features that provide hints on how the information is structured in the applied style guide. In summary, the bootstrap phase takes an initial look at the reference section and handles apparent style informa- tion over to the meta phase which finally processes the ref- erence section as if the applied style guide was known. For a detailed description of the process, first the applied terminology is presented. Then, conditional random fields and the usage of features are addressed. The elements of the local model that contain the knowledge of the docu- ment’s structure build an important part of the approach and are described in detail with an example. 2.1 Terminology In the presented approach different frameworks and toolk- its are combined. In order to clarify the terminology we explain some central terms. We use a nomenclature ori- ented at the Apache UIMA framework [Ferrucci and Lally, 2004]. Definition 1 (Typesystem, Information Type). A typesys- tem is a set T, whose elements are called (annotation or information) types. As an example, we define the label type system Tlabel = {AUTHOR, BOOKTITLE, DATE, EDITOR, INSTITU- TION, JOURNAL, LOCATION, NOTE, PAGES, PUB- LISHER, TECH, TITLE, VOLUME} which contains all field labels. Furthermore, we introduce the overall typesys- tem Tall which contains all types. Definition 2 (Annotation). Given a text document D and a typesystemT, we define an annotation as a triplet (s, i, j) ∈ T × N × N, consisting of an information type s ∈ T and two naturals i ≤ j, indicating the begin and the end of the annotation in D. For instance, we can assign an annotation (NUM, 28, 32) to a document to state that the 152 text covered by the offsets 28 and 32 is a num- ber. Therefore, we define an appropriate typesystem Tfeat = {COMMA, CW, SW, NUM, FirstName. . .} with Tfeat ∩ Tlabel = ∅. The typesystem Tfeat contains several useful low level information types called features, e.g., COMMA indicating commas, NUM indicating numbers, CW for capitalized words, SW for lower case words and FirstName indicating first names. These annotations are automatically assigned by the feature extraction, e.g. a word list with first names is provided and for each occur- rences of an entry an annotation of the type FirstName is created. Moreover, we partition documents into pieces of atomic lexical units, called tokens, to make use of the ClearTK framework [Ogren et al., 2008] and the machine learning toolkit Mallet2 for the implementation of the CRF. Definition 3 (Token). We postulate τ ∈ Ttoken (the Token- Type) to be a type which satisfies the following conditions. • Annotations of the type τ do not cover white space characters and • all other characters are covered of exactly one anno- tation of the type τ . Annotations of the type τ are called tokens. Punctuations and special characters are put in single to- kens. Alphabetic and numerical character sequences are split into separate token sets. Definition 4 (Feature). Iff a token xt is within3 an anno- tation of a type ϕ ∈ Tfeat, we say that xt has the feature ϕ. If a token xa = (τ, 6, 7) has the feature COMMA ∈ Tfeat, then the text covered by the token is a comma. The terms feature and type are used synonymously in the fol- lowing sections. 2.2 Conditional Random Fields Conditional Random Fields (CRF) [Lafferty et al., 2001] model conditional probabilities with undirected graphs. As usual in information extraction and sequence labeling tasks, we use linear chain CRFs. That is, we take a se- quence of tokens x = (x1, . . . , xT ) as input. Given binary feature functions f1, . . . , fK and parameters λ1, . . . , λK ∈ R, we compute the conditional probability of the label se- quence y = (y1, . . . , yT ) under x by Pλ(y|x) = 1 Zx exp ( T∑ t=1 K∑ i=1 λi · fi(yt−1, yt,x, t) ) , with a normalization factor Zx = ∑ y′∈Y exp ( T∑ t=1 K∑ i=1 λi · fi(y′t−1, y′t,x, t) ) . Y is the set of all possible label sequences y′ for x. In short, a feature function fi(yt−1, yt,x, t)) can testify evidence for the token at the position t to be labeled as yt, depending on the label of it’s predecessor and the observed input sequence. The feature functions are weighted by pa- rameters λ1, . . . , λK . Hence, if fi(yt−1, yt,x, t)) = 1 and λi has a high value, then we have strong evidence for label- ing xt as yt. Accordingly, the parameters determine how 2http://mallet.cs.umass.edu 3A token (τ, a, b) is within an annotation (ϕ, x, y), iff a ≥ x and b ≤ y. we infer the labels from the information given by the fea- ture functions, i.e., we assume the label sequence that is most likely given some observation sequence. As usual in supervised machine learning, we use a learning algorithm which sets the weights to make good predictions on a train- ing set. In principle, a feature function can make complex use of the whole input sequence, the current label and the pre- decessing label. However, we mainly use simpler fea- ture functions, named annotation-based feature functions, which factorize into two parts. Given two labels ya, yb ∈ Tlabel, a typesystem T and a type ϕ ∈ Tfeat, an annotation based feature function has the form: fϕ,ya,yb(yt−1, yt,x, t) = 1{yt−1=ya} · 1{yt=yb} · fϕ(xt). The first part is only an indication of the label transition and ensures that we can learn separate weights for each combi- nation of labels. On the contrary, the second part is inde- pendent from the labels. fϕ(xt) just shows if the token at the position t has the feature ϕ. In different words, fϕ(xt) = { 1, if xt has the feature ϕ, 0, otherwise. Finally, we create annotation based feature func- tions fϕ,ya,yb for every ya, yb ∈ Tlabel and every type ϕ ∈ Tfeat. By example, the CRF learns a parameter λNUM,AUTHOR,YEAR for the feature function fNUM,AUTHOR,YEAR, i.e., a weight for having the NUM (number) feature and transitioning from an author field to a year field. 2.3 Local Adaptivity The local model phase is the main part of the local adaptiv- ity. It analyses the given features and the initially extracted information of the bootstrap phase and creates a descrip- tion of the characteristics of the applied style guide. This description is then projected as features in order to be use- ful for the CRF in the meta phase. Therefore, the local model consists of two major steps: the creation of the de- scription and the projection of features. Overall, a local model can be seen as a representation of specific knowl- edge of each single document’s structure. There are vari- ous means to describe these local patterns of a document. The rule-based approach for local adaptivity [Kluegl et al., 2010] has shown that two characteristics are describing the applied style guide sufficiently for a considerable increase of accuracy. field separation One important consistency inside a ref- erence section is the way how fields are separated. For instance, one writer always ends the author lists with a period, another writer may use a colon instead. If such a field separator is once determined with the help of other references, then it can help solving am- biguos cases, for example, in the case when one of the first tokens of the title also contains a colon. For every label type ϕlabel ∈ Tlabel we try to detect fea- tures which indicate the begin or the end ofϕlabel fields in a document. These additional features BEGINϕlabel and ENDϕlabel ∈ Tmeta-feat then indicate the document specific seperators for the meta phase. For instance, if a token xt has the feature LPAREN and we have recognized that date fields begin with a left parenthe- sis in this document, then we assign an annotation of the type BEGINDATE ∈ Tmeta-feat to xt to state that 153 we have evidence for the begin of a date field. In addition to these two meta features, we also intro- duce two specialized meta features BEGINtϕlabel and ENDtϕlabel ∈ Tmeta-feat that restrict the projection of the feature dependent on the initally extracted informa- tion. field sequence Style guides define not only the way of field separation. The sequence and alignment of the fields normally does not change within a reference section. Although some fields are optional and may be skipped by the author, information about the oc- curring sequences can resolve ambiguities and be of assistance in classification. As a simple example, we refer to the date field of the reference. Normally, the date is located either directly after the author or near the end of the reference. If no features indicate a date in the current reference, then information about the field before and the field after the dates of the remain- ing references hepls to find the date. For every label type ϕlabel ∈ Tlabel we try to detect fields that are nor- mally located before and after the fields with the la- bel ϕlabel. These additional features BEFOREϕlabel and AFTERϕlabel ∈ Tmeta-feat then indicate the inherent se- quences of the reference section for the meta phase. For instance, if the analysis of the extracted informa- tion is confident that the field date is always followed by the pages field, then we assign an annotation of the type AFTERDATE ∈ Tmeta-feat to each token that was labeled with type ϕPAGES. Summarizing, annotations of the types t ∈ Tmeta-feat are used to enrich the feature functions of the meta phase. In the following, we describe how these types are determined with the use of the given annotation-based features and the initially extracted information. First, an observed meta feature is created for the selec- tion of types that are suitable for a meta feature. Definition 5 (Observed Meta Feature). ϕ∗meta ∈ Tmeta-observed is defined as the manifestation of the corresponding meta feature ϕmeta ∈ Tmeta-feat in the case that the information was extracted perfectly. In other words, the observed meta feature ϕ∗meta ∈ Tmeta-observed is automatically assigned to those tokens of each reference that are located exact at the positions indi- cated by the meta feature ϕmeta ∈ Tmeta-feat. The observed meta features BEGIN∗AUTHOR, for example, is assigned to the first token of the initially extracted author field and the observed meta features END∗AUTHOR is assigned to the last token of the author field. The observed meta features will only be utilized to de- termine suitable types for the meta features. All available types of features ϕ ∈ Tfeat are compared to the observed meta features ϕ∗meta using a similarity measure. It is useful to consider the shape of the tokens and their properties to be the outcome of a stochastic event. From this point of view, fϕ (cf. section 2.2) is a random variable and p(fϕ=1) rep- resents the probability that a token has the feature ϕ. Addi- tionally, p(fϕ1=1, fϕ2=1) is a joint probability, indicating how likely a token has both the feature ϕ1 and the feature ϕ2. By example, p(fNUM=1, fCW=1) = 0 since tokens cover either numbers or capitalized words. The mutual information has shown to be a sound simi- larity measure. Between two random variables X and Y MI(X,Y ) = ∑ x∈X ∑ y∈Y p(x, y) · log ( p(x, y) pX(x) · pY (y) ) measures how much information X and Y share. It cov- ers all possible outcomes of X and Y . However, since we are only interested in the coincidence of a feature ϕ and the observed meta features ϕ∗meta and not, for example, in the absence of a feature, the sum of all different values or occurrences of the features is removed. Hence, the mu- tual information can be reduced to a weighted pointwise mutual information. The probability distributions are then estimated by the observed frequencies pˆ in the reference section: α(ϕ,ϕ∗meta) = pˆ(fϕ=1, fϕ∗meta=1)·log pˆ(fϕ=1, fϕ∗meta=1) pˆ(fϕ=1) · pˆ(fϕ∗meta=1) High values of α(ϕ,ϕ∗meta) indicate that the type ϕ is suitable to describe the meta feature ϕmeta. The weighted pointwise mutual information is motivated with the fact the rare occurrences of an information and a feature aren’t rep- resentative for the complete reference section. After applying the formula on all available features, we gain a sorted list of rated candidates for each meta fea- ture. For the presented work no conjunctions of features for the manifestation of a meta feature are utilized. How- ever, a meta feature cannot always be described by a single feature, but requires sometimes a disjunction of features. Therefore, instead of only using the highest rated feature for the description of the meta feature, each feature is con- sulted that fulfills two conditions: its α rating exceeds a given threshold β and the annotations of the feature are disjoint to the other selected features whereas higher rated features are preferred. On the one hand, some rare applied style guides are able to create different separators for a field. But also with a strict style guide applied, the absence of some information can require a description of a meta fea- ture with several features. If the date contains an informa- tion about the month in fifty percent of its occurrences, then the start separator of the date is either a number or a word indicating a month name. Hence, the description of the be- gin of the date would be described best with two features. For that reason, several feature are allowed for the manifes- tation, but only if they are not redundant, i.e. are disjoint to the already selected features of higher rating. For the computation of disjoint features the joint probability with the observed frequencies is reused. Two features ϕ1 and ϕ2 are considered disjoint iff pˆ(fϕ1 = 1, fϕ2 = 1) ≈ 0. A minimal margin was applied since we assume a fallible feature extraction that erroneously assigns a feature on rare occasions. The threshold is applied because of the weighted point- wise mutual information. Features that only occur once or twice in a document are not confident enough for the de- scription of the local model even if they are disjoint with the already selected features. This selection of disjoint fea- tures does not need to be applied for the sequences of the fields due to the characteristics of the reference parsing do- main where the labels build a disjoint partition of the com- plete reference. Only the threshold is used to filter rare sequences of fields. In order to use the meta features in the common process model, the types of annotations are projected by provid- ing an annotation-based feature function. Additionally to 154 the already described feature function, a specialized feature function fϕtmeta is added for the separation of the fields. fϕtmeta(xt) =  1, if xt has the feature ϕmeta and |t− o| is minimal with fϕ∗meta(xo) = 1, 0, otherwise. Here, a token possesses this meta feature only if it is located nearest to the observed meta feature ϕ∗meta. The projection is of course limited to the currently considered reference. The combination of both strategies for the pro- jection of separators enforces the reuse and simultaneously the correction of the initially extracted information. The complete process of the creation of the local model and its projection as features is summarized in algorithm 1: Algorithm 1 Local Model Phase for all ϕmeta ∈ Tmeta-feat do l← new list for all ϕ ∈ Tfeat do if (∀ϕ′ ∈ l : pˆ(fϕ = 1, fϕ′ = 1) ≈ 0) ∧ α(ϕ,ϕ∗meta) > β then add ϕ to l project l as manifestation of ϕmeta 2.4 Example The selection of features and their projection as meta fea- tures are illustrated with a simplified example focusing on the meta feature ENDAUTHOR. The input of the presented approach is a reference section with 20 references and over- all 799 tokens that have been labeled in the bootstrap phase. Figure 2 contains two references of the reference section. The first row shows the begin of the reference, whereas the labels assigned by the bootstrap phase are depicted in the second row. Obviously, the CRF falsely labeled the tokens “Exokernel:” as author in the first reference. In the next four rows, a selection of features are added. PM stands for all punctuation marks, PERIOD for periods, COLON for colons and PeriodSep for periods that are not part of abbre- viations or name initials. Finally, the last two rows contain the computed meta features ENDAUTHOR and ENDtAUTHOR. Applying the similarity measure on the given features re- sults in a rating how good the feature describes the end of the author. The values are given for two suitable features: α(ϕPERIOD,END∗AUTHOR) = 18 799 · log 18 799 143 799 · 19799 = 0.0375 α(ϕPeriodSep,END∗AUTHOR) = 18 799 · log 18 799 63 799 · 19799 = 0.0560 Since a token with the feature ϕPeriodSep always has the feature ϕPERIOD, both features are not disjoint. The local model now states that the end of the author is at best de- scribed by a single feature. Hence, ϕPeriodSep is assigned to ENDAUTHOR in this document and the feature function fENDAUTHOR(xt) = 1, iff xt has the feature ϕPeriodSep. In a different reference section, for example, with another style guide applied fENDAUTHOR(xt) = 1, iff xt has the feature ϕCOMMA or ϕCOLON. The meta feature ENDtAUTHOR is con- sequently only assigned to the token that is nearest to the observed meta feature END∗AUTHOR. That is the 20th to- ken in the first reference and the 8th token in the second reference. The CRF of the meta phase now has access to additional features of high quality resulting in an increased accuracy. 3 Experimental Study We have evaluated the presented process model, the idea of the local adaptivity and its novel combination of state of the art methods in an experimental study. First, the applied data sets, features and settings of the study are described. Then, the results of the evaluation are presented and discussed. 3.1 Data sets The labeled data sets CORA (500 references), CITESEERX (200 references) and FLUX-CIM (300 references, CS do- main)4 build the source of the evaluation data set. All three data sets consist of a listing of single references without the context of the original reference section. Therefore, these data sets are not directly applicable for the presented ap- proach. A simple script was developed in order to recon- struct reference sections as they would occur in real pub- lications using only references originated in the available data sets. Due to the simplicity of the assignment script and the distribution of the reference styles in the dataset a considerable amount of references could not be assigned to a paper. The resulting data set DPaper contains 28 docu- ments and overall 452 references and resembles reference sections of real papers. Therefore, our data set can be con- sidered more natural. Some erroneous labels and defects due to obvious differences in the annotation guide lines of the three original data sets were corrected. DPaper is ran- domly splitted into three folds for the evaluation. DTrainPaper (315 references, two folds) is used for the training and DTestPaper (137 references) for testing. Additionally, DRest contains 350 randomly selected references of the remain- ing references of the original data sets. 3.2 Features Similar to previous studies with CRFs we use features in- dicating the capitalization, the length of tokens, numbers, whitespaces on the left and on the right of the observed to- ken, the relative position inside the reference string, n-gram prefixes, n-gram suffixes, as well as the covered text of the token and the covered text of tokens on the left and on the right of the observed token. These features are integrated as normal feature functions. Additionally, annotation-based feature functions, previously denoted by Tfeat, for token classes and combinations of tokens are applied. To these belong different usages of punctuations, regular expres- sions for URLs and simple combinations of features, for example a first name and a capitalized word. Dictionaries for first names, stop words, locations, keywords, journals and publishers were created and added to the annotation- based feature functions. Overall, the applied features are comparable to previously published approaches. 3.3 Settings Overall, three CRFs are trained for the experimental study: BOOTSTRAP, META and COMPARE. All of them were re- lying on the same features described in section 3.2. The overall process is build upon UIMA and the ClearTK framework. The machine learning toolkit Mallet is used for an implementation of the CRFs. The presented process model contains two CRFs. The CRF of the bootstrap phase (BOOTSTRAP) represents a simple model for the extraction task. It is trained on the data set DRest and 250 iteration 4all three data sets are available online, e.g., at http:// wing.comp.nus.edu.sg/parsCit/ 155 D . R . Engler , M . F . Kaashoek , J . W . O ’ Toole . Exokernel : An Operating System Architecture for Application Level . . . Bootstrap result A A A A A A A A A A A A A A A A A A A A A A T T T T T T T END∗AUTHOR x PM x x x x x x x x x x PERIOD x x x x x x COLON x PeriodSep x ENDAUTHOR x ENDtAUTHOR x S . Seneff and J . Polifroni . A new restaurant guide conversational system : Issues in rapid prototyping for specialized domains . In Pr. . . Bootstrap result A A A A A A A A T T T T T T T T T T T T T T T B B END∗AUTHOR x PM x x x x x PERIOD x x x x COLON x PeriodSep x x ENDAUTHOR x x ENDtAUTHOR x Figure 2: Two exemplary references with the initially extracted information, the some given features and the assigned meta features. The occurrence of a feature is indicated with “x” and the label of a token is denoted with first letter of the label. were applied. The CRF of the meta phase (META) has ac- cess to the additional meta features Tmeta-feat. The model is trained on the data set DTrainPaper with unlimited iterations and the threshold β for the meta features is set to 0.01. An additional CRF (COMPARE) is also trained on the data set DTrainPaper with the settings of META in order to compare the increase of accuracy due to the meta features. A gaussian variance of 10 is used and the markov order is set to one for all CRFs. The two CRFs BOOTSTRAP and META are trained on two different data sets. The meta features need to be created on real results and not on the almost perfectly labeled data of a training process. If the meta features are only created for correct results in the meta phase, then the advantages for the contextual reuse and correction of the initially extracted information are forfeited. The training of the meta phase needs to rely on realistic and therefore not perfect results for a suitable integration of the meta fea- tures. 3.4 Performance Measure The performance of the presented approach is measured with commonly used methods of the domain. For a field label l ∈ Tlabel, let tp(l) be the number of true positive classified tokens for the label l and define fn(l) and fp(l) respectively for false negatives and false positives. Since punctuations contain no information in this domain, only alpha-numeric tokens are considered. Precision, recall, F1 and average F1 are computed by precision(l) = tp(l) tp(l) + fp(l) , recall(l) = tp(l) tp(l) + fn(l) , F1(l) = 2 · precision(l) · recall(l) precision(l) + recall(l) , Average = 1 |Tlabel| ∑ l∈Tlabel F1(l). The instance accuracy measures how many references have been perfectly classified Instance = #references without an error #all references . 3.5 Results Table 2 contains the results of the experimental study. The second column lists the true positives tp(l) of each Table 1: Results of the evaluation of the three CRFs. The average F1 is computed without the editor and note field. BOOTSTRAP COMPARE META tp CRF CRF CRF Author 821 99.0 99.1 99.5 Booktitle 670 94.8 95.1 97.5 Date 200 95.4 98.0 97.8 (Editor) 7 0/100 0/100 0/100 Institution 86 32.7 97.1 95.1 Journal 186 96.8 89.0 98.1 Location 51 86.4 91.7 92.6 (Note) 3 0/100 0/100 0/100 Pages 222 90.7 97.5 97.7 Publisher 33 87.5 98.5 93.7 Tech 75 37.0 87.4 94.4 Title 1064 97.0 96.6 98.3 Volume 84 98.8 85.1 98.8 Average∗ 83.3 94.1 96.7 Instance 75.9 78.8 85.4 field l ∈ Tlabel and the remaining columns contain the F1 scores of the three evaluated CRFs BOOTSTRAP, COM- PARE and META tested on the data set DTestPaper. The aver- age F1 and the instance accuracy are added in the last two rows. As mentioned before, the amount of true positives is much smaller than the number of tokens since only alpha- numeric tokens are considered in the evaluation. The infor- mation of a date field, for example, is independent of sur- rounding parentheses or punctuation marks. There are no values added for the editor and note fields. Both fields con- sist only of a few tokens and achieved an F1 score of 100.0 in most of the evaluation runs. However, a F1 score of 0.0 was also sometimes obtained dependent on the distribu- tion of examples in the two data sets DTrainPaper and D Test Paper. Therefore, both fields are not considered in the calculation of the average F1. The CRF META achieved an instance accuracy of 85.4% and an average F1 score of 96.7%. Compared to the CRF BOOTSTRAP, the error of the instance accuracy was re- duced by 39.4% and the error of the average F1 by 80.1%. Compared to the CRF COMPARE that was trained on the same data set, the error of the instance accuracy was re- duced by 31.1% and the error of the average F1 by 44.1%. 156 A closer look at the single fields of META and COMPARE reveals that the meta phase was able to improve eight fields and worsened the results of only three fields. Two of these fields, the location and the publisher, contain less true posi- tives than the other fields and strongly depend on dictionar- ies. The difference in the date fields is caused by only one single misclassified token. Overall, META created 3435 true positives, whereas COMPARE classified 3376 true pos- itives. 3.6 Discussion The combination of two CRFs and analysis of the local consistencies achieves better results than a single CRF, the state of the art method in the domain of reference extraction. Although the result of the bootstrap phase is mediocre, the local model and the projection of its knowl- edge are robust enough to create valuable meta features. Hence, the meta phase is able to outperform the commonly applied process. A closer look at the extraction results re- veals that the presented approach still trails behind its own potential. The combination of features and meta features often create a situation where a correct classification is ob- vious. Many false positives and false negatives should not occur with the available features at hand. The boundaries of the author field, for example, are perfectly defined by the created separator features, but the CRF still labels some to- kens of the author erroneously. Therefore, the presented approach still provides enormous potential for improve- ments. A different information extraction technique might integrate the knowledge about the local consistencies bet- ter in the meta phase than CRFs. Furthermore, a direct combination of both CRFs in the learning process or an improved projection of the meta features can improve our process model. The effect of the presented approach on unknown style guides should be investigated in detail. The test data set DTestPaper already contains references with style guides that aren’t present in the training data set DTrainPaper. However, a test data set only containing unknown styles can illustrate the advantages of our approach compared to the common process model furthermore. A comparison to the results of related publications is problematic. Although the instances of the applied data set were used in previous evaluations, the results can hardly be compared as three different data sets were mixed and some references are left out. 4 Related Work The extraction of references is an active field of research. Techniques based on Hidden Markov Models, Maximum Entropy Models and Support Vector Machines and sev- eral approaches using CRF were published. Peng and Mc- Callum [Peng and McCallum, 2004] established CRF as the state of the art approach for the reference extraction task. They used 350 references of the CORA data sets for training and 150 references for the evaluation. Councill et al. [Councill et al., 2008] applied CRF in their ParseCit system on the CORA data set and evaluated their approach with a 10 fold cross evaluation. In addition, they evalu- ated also the data sets CITESEERX and FLUX-CIM. Both approaches achieved an average F1 score of ≈92% and a modified average F1 score of ≈93%. Table 2 contains de- tails of their evaluation results. Ng [Ng, 2004] has built the first version of ParsCit. It was based on the Maximum Entropy paradigm and the ac- curacy was worse compared to the performance of the cur- Table 2: Results of related publications. In addition to the average F1 score, the Average* is computed without the editor and note fields. [Peng and McCallum, 2004] [Councill et al., 2008] CORA CORA CITESEERX FLUX-CIM Author 99.4 99 96 99 Booktitle 93.7 93 81 97 Date 98.9 99 94 97 Editor 87.7 86 67 - Institution 94.0 89 74 - Journal 91.3 91 83 89 Location 87.2 93 85 89 Note 80.8 65 29 - Pages 98.6 98 91 97 Publisher 76.1 92 81 85 Tech 86.7 86 73 - Title 98.3 97 93 96 Volume 97.8 96 87 92 Average 91.5 91.1 79.5 93.4 Average* 92.9 93.9 85.3 93.4 Instance 77.3 - - - rent system. However, Ng identified different categories of flaws in his extraction process and applied an additional phase for their correction. One step of these repairs pro- cessed repeating fields of a single reference, e.g., the oc- currence of multiple titles. For the correction of this error, he created a list of all sequences of fields within the ex- tracted references. Then, the multiple fields were resolved using the sequence that occurred most. With all repairs ap- plied, the instance accuracy was increased from 45.6% to 60.8% on the CORA dataset. Compared to our approach, Ng applied only one specialized repair on the sequences in a post processing step in order to correct an error that can be prevented by applying a CRF instead. Furthermore, the evaluation with the CORA data sets itself prevents any statements about improvements by the usage of local con- sistencies. There are also some knowledge engineering approaches. Cortez et al. [Cortez et al., 2007] evaluated their unsuper- vised lexicon-based approach on data sets of the domains health science and computer science. An automatically generated, domain-specific knowledge base is applied af- ter a chunking of each text segment in order to identify the fields. Day et al. [Day et al., 2007] created templates for well-known reference styles and used them to extract the fields in journal articles. They achieved an average accu- racy of 92.4% on complete fields. Kluegl et al. [Kluegl et al., 2010] proposed a local adap- tive extraction of references with handcrafted transforma- tion rules. A simple model extracts initial fields. The lo- cal patterns of these information are stored in a short term memory and are used to create a description of the style guide of the reference section. Then, transformation rules match on a meta level and provide an automatic adaption on the internal previously unknown consistency of the doc- ument. The approach is evaluated only on the fields author, title, editor and date. They achieved an average F1 score of 97.6% on the complete CORA data set and an average F1 score of 99.7% on natural reference sections based on the CORA data set. This rule-based work on local adaptivity is the basis of our approach. We continued and extended the previous work by two major points: We exchanged the la- 157 borious knowledge engineering in all stages and combined an automatic creation of the local model with state of the art techniques. Only the definition of suitable features in the feature extraction requires manual effort of a knowledge engineer. Additionally, all 13 possible fields of a reference are extracted and evaluated instead of the four fields. Our process model can be compared to stacked general- ization [Wolpert, 1992] where a meta model is learnt using the output of initial, basic models. Sigletos et al. [Sigletos et al., 2003] among others adapted this meta-learning ap- proach of classification for the information extraction task. In contrast to this approach, we apply only a single base model and are not trying to compose a model directly on the initially extracted information. Instead, we are creating new features in order to exploit the patterns of the unlabeled data during the extraction task. Recently, advances have been made in joint infer- ence [Poon and Domingos, 2007] that combine different steps of the information extraction task. The presented work possesses some simple peculiarities of that approach, e.g., features transfer inference information in one direc- tion. There is also work published on extending or param- eterizing the features of discriminative models. Stewart et al. [Stewart et al., 2008], for example, are learning flexible features for the extraction of references. 5 Conclusions We have presented a novel combination of two CRFs ap- plied on the local adaptive extraction of references. The initial results of the first CRF are exploited to gain infor- mation about the local consistencies. Then, the second CRF is automatically adapted to the previously unknown style guide. This is achieved by changing the manifesta- tion of its features dependent on the currently processed reference section. The results indicate a considerable im- provement towards the commonly applied process model. We achieved an average F1 score of 96.7% and an instance accuracy of 85.4% on the test data set. Several directions merit a further exploration of the pre- sented work. A combined inference of both CRFs and a local model beyond a description by single features might exploit the full potential of the approach. Combinations and sequences of features are able to describe the local consistencies even if the features extraction provides only simple features. Moving further in this direction leads to a knowledge-based local model that is created, e.g., us- ing subgroup discovery techniques. Comparing the pre- sented approach to previous work [Kluegl et al., 2010], a transformation-based correction of the initially extracted information appears to be very suitable for this task and is able to integrate the local model more straightforward. The knowledge engineering effort can be avoided by learn- ing several binary classificators. Support Vector Machines, for example, can be trained to define transformations of the given information dependent on the local model. Our ap- proach is not restricted to the extraction of references. Even greater improvements compared to the state of the art meth- ods are possible in other domains like curriculum vitae and medical patient records (cf. [Kluegl et al., 2009]). References [Cortez et al., 2007] Eli Cortez, Altigran S. da Silva, Mar- cos André Gonçalves, Filipe Mesquita, and Edleno S. de Moura. FLUXCiM: Flexible Unsupervised Extrac- tion of Citation Metadata. In JCDL ’07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital li- braries, pages 215–224, New York, USA, 2007. [Councill et al., 2008] Isaac Councill, C Lee Giles, and Min-Yen Kan. ParsCit: an Open-source CRF Refer- ence String Parsing Package. In Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco, 2008. ELRA. [Day et al., 2007] Min-Yuh Day, Richard Tzong-Han Tsai, Cheng-Lung Sung, Chiu-Chen Hsieh, Cheng- Wei Lee, Shih-Hung Wu, Kun-Pin Wu, Chorng-Shyong Ong, and Wen-Lian Hsu. Reference Metadata Ex- traction using a Hierarchical Knowledge Representation Framework. Decis. Support Syst., 43(1):152–167, 2007. [Ferrucci and Lally, 2004] David Ferrucci and Adam Lally. UIMA: An architectural approach to un- structured information processing in the corporate research environment. Natural Language Engineering, 10(3/4):327–348, 2004. [Kluegl et al., 2009] Peter Kluegl, Martin Atzmueller, and Frank Puppe. Meta-level Information Extraction. In Bärbel Mertsching, Marcus Hund, and Muhammad Za- heer Aziz, editors, KI, volume 5803 of Lecture Notes in Computer Science, pages 233–240. Springer, 2009. [Kluegl et al., 2010] Peter Kluegl, Andreas Hotho, and Frank Puppe. Local Adaptive Extraction of References. In 33rd Annual German Conference on Artificial Intelli- gence (KI 2010). Springer, 2010. accepted. [Lafferty et al., 2001] J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proc. 18th International Conf. on Machine Learning, pages 282–289, 2001. [Ng, 2004] Yong Kiat Ng. Citation Parsing using Maxi- mum Entropy and Repairs. Undergraduate thesis, Na- tional University of Singapore, 2004. [Ogren et al., 2008] Philip V. Ogren, Philipp G. Wetzler, and Steven Bethard. ClearTK: A UIMA toolkit for sta- tistical natural language processing. In UIMA for NLP Workshop at LREC, 2008. [Peng and McCallum, 2004] Fuchun Peng and Andrew McCallum. Accurate Information Extraction from Re- search Papers using Conditional Random Fields. In HLT-NAACL, pages 329–336, 2004. [Poon and Domingos, 2007] Hoifung Poon and Pedro Domingos. Joint Inference in Information Extraction. In AAAI’07: Proc. of the 22nd National Conference on Ar- tificial intelligence, pages 913–918. AAAI Press, 2007. [Sigletos et al., 2003] Georgios Sigletos, Georgios Paliouras, Constantine D. Spyropoulos, and Takis Stamatopoulos. Meta-Learning beyond Classification: A Framework for Information Extraction from the Web. In Proc. of the Workshop on Adaptive Text Extraction and Mining. The 14th ECML and the 7th Euro. Conf. on PPKD, 2003. [Stewart et al., 2008] Liam Stewart, Xuming He, and Richard S. Zemel. Learning Flexible Features for Con- ditional Random Fields. IEEE Trans. Pattern Anal. Mach. Intell., 30(8):1415–1426, 2008. [Wolpert, 1992] David H. Wolpert. Stacked Generaliza- tion. Neural Networks, 5:241–259, 1992. 158 Towards Understanding the Changing Web: Mining the Dynamics of Linked-Data Sources and Entities [work in progress] Ju¨rgen Umbrich and Marcel Karnstedt Digital Enterprise Research Institute (DERI) National University of Ireland, Galway, Ireland firstname.lastname@deri.org Sebastian Land Rapid-I GmbH land@rapid-i.com Abstract A huge amount of content found on the Web is dynamic by its nature, particularly with the rise of Web 2.0 and beyond. This is of special interest for the Semantic Web community, not only but particularly regarding resources on the Linked Open Data (LOD) Web. However, the dataset dynamics of the LOD graph are hardly explored so far. Existing approaches from the traditional HTML Web are not sufficient to mine and discover the dynamics with satisfying accu- racy and efficiency, as they do not consider the special characteristics of LOD. We present first initial results on this topic and discuss future steps. First results are obtained by mining the groups of URIs with similar change frequencies and by applying time series techniques as well as clustering techniques. 1 Motivation At the time of writing, we can find several hundred datasets published as Linked Open Data (LOD) on the Web. The LOD cloud contains up to several billion Resource Descrip- tion Framework (RDF) triples of machine readable infor- mation, describing real world entities (resources) and the relations among them. This data forms a huge directed and labelled graph (resources are nodes, relations are edges), which can be accessed, traversed and consumed by humans and intelligent software agents. Currently, the so formed LOD graph (see Figure 1) contains billions of nodes but only a couple of million edges 1. However, experts in the field commonly agree that this gigantic graph will even grow faster in the future and become more dense by adding more labelled links between the nodes. New data and links between the data are added either by humans or by ma- chines (e.g., from data converters like Any232 or by content management systems like Drupal 7). The LOD research community focuses on the different issues in identifying, linking and publishing this data. A still nearly unexplored field are the dynamics of the contained data sets. With the term dataset dynamics we refer to content and interlinking changes in the Linked Data graph. Besides content changes, the dynamics of nodes (i.e., entities) and links are of particular interest. These can be roughly categorised as follows: 1http://esw.w3.org/TaskForces/ CommunityProjects/LinkingOpenData/DataSets/ [Statistics|LinkStatistics] 2http://any23.org/ Figure 1: The LOD cloud in mid 2009, courtesy of Cyga- niak and Jentzsch. • Dynamics of resources: New nodes are added and old nodes are removed; • Dynamics between resources: New relations are added (in the form of new triples) and old ones might be removed; In a first attempt [Umbrich et al., 2010] we monitored the change frequency of Linked Data resources and revealed that most of them show the same behavior as HTML Web documents. However, to the best of our knowledge, there is no published work describing attempts to apply techniques from other mature research areas dealing with the analysis and investigation of dataset dynamics. In this work, we take a closer look at how methods from the data mining and machine learning community can be used for this task. We identify suitable and promising techniques, discuss the benefit we expect from their application and present some preliminary results indicating the usefulness. The motivating use-case for our study of dataset dynam- ics is to improve concurrent work on an efficient system for performing live queries over the LOD Web [Harth et al., 2010]. The challenge is to decide which (sub-)queries can be run against a cache or data summary and which (sub-)queries have to be executed live over the Web con- tent to guarantee up-to-date results. However, there are several other tasks related to managing Web data that can benefit from a deeper understanding of dataset dynamics: Web crawling and caching [Cho and Garcia-Molina, 2003], maintaining link integrity [Haslhofer and Popitsch, 2009], serving of continuous queries [Pandey et al., 2003]. Regarding the above sketched use cases, there are some main questions we have to focus on. These are: • What classes of dynamics can and should we distin- guish? 159 • What actual methods are suited to classify LOD re- sources with respect to their dynamics? • On what features should these methods be based on? • How can we efficiently predict future changes and the types of changes? To start the investigation of these questions we firstly focus on one of the underlying main questions, which we see as the most interesting starting point: What correlations between resources and their dynamics can we identify using methods from data mining, machine learning and graph analysis? The specific correlations we expect and therefore plan to investigate are correlations between the dynamics of the re- sources and: 1. their domain names (i.e., their origin) 2. the used vocabulary (i.e., RDF predicates and classes) 3. their linkage (i.e., if one resource changes how likely is it that resources linked to it change as well) To achieve this, we first started to continuously monitor a large set of LOD resources. Based on the observed dynam- ics, we can cluster these resources and apply correlation analysis between sets of resources as well as between re- sources and their features. In this work, we present initial results on this analysis and discuss future steps. 2 Dataset Dynamics on the LOD Web In this section, we first introduce the data model of RDF and how it contributes to the LOD Web. Secondly, we dis- cuss how our problem of studying dataset dynamics can be mapped to the problem of the dynamics of nodes in a labelled directed graph. Finally, we elaborate on the im- portance of data mining to reveal new insights into the dy- namics of such a graph. 2.1 RDF and Linked Data The Resource Description Framework [Manola and Miller, 2004] defines a data format for publishing schema-less data on the Web in the form of (subject, predicate, object) triples. These triples are composed of unique identifiers (URI references), literals (e.g., strings or other data values), and local identifiers called blank nodes as follows: Definition 1. (RDF Triple, RDF Term, RDF Graph) Given a set of URI references U , a set of blank nodes B, and a set of literals L, a triple (s, p, o) ∈ (U ∪B)×U × (U ∪B∪L) is called an RDF triple, We call elements of U ∪B∪L RDF terms. Sets of RDF triples are called RDF graphs. The notion of graph stems from the fact that RDF triples may be viewed as labelled edges connecting subjects and objects. RDF published on the Web according to the following principles is called Linked Data [Berners-Lee, 2006]: 1) URIs are used as names for things: in contrast to the HTML Web where URIs are used to denote con- tent (documents, images), on the Semantic Web URIs can denote entities such as people or cities; 2) URIs should be dereferenceable using the Hypertext Transfer Protocol (HTTP): a user agent should be able to perform HTTP GET operations on the URI; 3) Useful content in RDF should be provided at these URIs: a Web server should return data en- coded in one of the various RDF serialisations; 4) Include links to other URIs for discovery: a user agent should be si,t o1 o2 o3 · · · on < URI1 > s1,1 s1,2 s1,3 · · · s1,n < URI2 > s2,1 s2,2 s2,3 · · · s2,n ... ... ... ... . . . ... < URIm > sm,1 sm,2 sm,3 · · · sm,n Table 1: Change matrix with si,t ∈ {−1, 0, 1, 2}. able to navigate from an entity to associated entities by fol- lowing links, which enables decentralised discovery of new data. source A source B source C source D a e b k ka k c b a w w k b d k d k e k w A B C D E F G H I J K L M N O P Figure 2: Abstraction of a subgraph of the LOD Web. As such, Linked Data forms a labelled directed graph as depicted abstractly in Figure 2. We can see that sub- graphs are contained in sources (a source is a container of a sub-graph and also has a unique URI, in analogy to Web documents), where resources and sources are interlinked. For the remainder of this paper, we focus on how to anal- yse the dynamics of the nodes and edges in such a graph. For further simplification, we do not distinguish between changing sources and entities – interested readers are re- ferred to [Umbrich et al., 2010]. We use the following def- inition of a change of a node: Definition 2. (Node state change si,t) A change of the state of a node identified by URI Ui is detected iff the tree of depth 1 with root node Ui differs between two observations ot−1, ot. We denote a change with si,t = 2, si,t = 0 other- wise. In addition, if a node appears in t we denote this with si,t = 1, if it disappears we use si,t = −1. The results of monitoring these changes over time are represented in an m×n change matrix (m resources and n observations) as illustrated in Table 1. 2.2 Data Mining Traditionally, change frequencies of Web documents are modeled as Poisson processes [Cho and Garcia-Molina, 2000] and advanced estimators are used to predict the like- lihood of the next change [Cho and Garcia-Molina, 2003]. Our current findings uncover that this model holds only for some resources of the LOD Web [Umbrich et al., 2010]. In addition, we strongly believe that by applying more ad- vanced machine learning methods we can perform better predictions of changes. By applying these techniques, we expect to reveal the existance of correlations between the change characteristics of different resources. As such, we will investigate correlation analysis, frequency analysis and techniques for change detection. Concrete techniques that we plan to assess are, among others, SVD and SVM, DFT and wavelet transformation as well as change point detec- tion as known from data streams and graphs. 160 3 Describing Change Features and Clustering A first step towards our general objectives is to identify nodes with similar dynamics. Thus, we first investigated how to cluster nodes wrt. to their characteristic dynamics. The main questions we have to answer for that is: What are features describing the dynamics of a resource? Possible describing features are: • Average change frequency ratio: The dynamics of each node can be simply described by the number of node state changes divided by the overall number of snapshots in the monitored period. • Statistical summaries of change behaviour: We could use statistical summaries of node state changes, such as central tendencies (arithmetic mean, median or in- terquartile mean) or statistical dispersion (standard de- viation, variance or quantiles). • Periodicities of changes: such periodicities can be de- termined by DFT or wavelets transformation, which would overcome obvious issues in using an average change frequency (e.g., one entity changing very of- ten in the beginning of the monitored time but then being rather static, compared to another entity chang- ing regularly in larger intervals over all the time). • Eigenvalues or principal components: Using reduc- tion techniques like SVD or PCA we can try to extract significant features capturing the characteristics of the dynamics. Once we decided how to represent dataset dynamics on the basis of the dynamics of nodes we can apply cluster- ing algorithms to group nodes with similar dynamics. Pre- vious experiments indicated the existance of significantly different clusters [Umbrich et al., 2010]. Afterwards, we will use the nodes from the gained clusters to analyse cor- relations among them and among single nodes and their edges. Methods we have in mind for that are classifica- tion approaches and correlation analysis, such as comput- ing the correlation and covariance matrices between URI attributes; e.g., the correlation between node state changes and/or the type of incoming links. We will look into differ- ent methods to compute the correlation coefficients; e.g., the Pearson or Pearman’s correlation coefficient or entropy- based mutual information/total correlation methods that al- low us to detect even more general dependencies. These represent first steps to answer the aforementioned questions about concrete correlations. 4 Preliminary experiments One question we already investigate in this work (see Sec- tion 4.3) is: How many clusters can we identify? Or: What is the “optimal” number of clusters? This is particularly relevant as we first decided to apply k-Means clustering, which requires an a-priori definition of k. If the number of clusters is too small, the clusters contain too many items which are not very similar. If it is too large, we do not capture the actual similarities. With this we want to over- come the limitations of using “soft” categories, e.g., using pre-defined classes like static, low dynamic, medium dy- namic, very dynamic. This might be sufficient for some applications, but not for the particular approaches we are working on. For instance, we want to be able to decide ac- curately when a crawler has to revisit a site to update an index or how to set a sort of cache coherence time in or- der to achieve satisfying data freshness. The materials and methods used and applied to get these first preliminary re- sults are described next. 4.1 Data As the data for our experiments we use a 1% random sam- ple from a data set containing 11 weekly observations of 161K LOD sources crawled using the LDSpider frame- work 3. The average observations size is 440MB gzipped and a total number of 2.7M nodes and roughly 7 million links over the whole monitored time. We used the random sample to achieve manageable processing times. For analysing the sample, we decided to use the data- mining framework RapidMiner4. RapidMiner offers a wide variety of data-mining tools, ranging from basic data clean- ing to complex transformations and analyses. 4.2 Methods For detecting node state changes between two observa- tions ot−1, ot we used a straightforward approach based on a merge-sort scan. After sorting all relevant statements by their syntactic natural order (subject-predicate-object), we performed a pairwise comparison of the statements by scanning two observations in linear time. We record a state change as soon as the order of the statements differed be- tween two observations (e.g., a data producer adds or re- moves outgoing edges and nodes). The used sample consists of 26961 URIs and 10 obser- vations (M=26961 × 10 matrix). In a pre-processing, step we transformed the rectangularly shaped input data into smooth series of values to finally compute the frequency spectrum with a Fourier analysis. The concrete steps in- volved are: 1. Organise by change events: We split the input change matrix into four matrices in the first trans- formation process. One matrix Mev = [mevi,t]; i = 1, 2, . . . ,m; t = 1, 2, . . . , n for each change event ev [ev = -1 (disappear), 0 (no change),1 (appear), 2 (change)]. The matrix values are encoded as follows: mevi,j = { 1 if si,t = ev 0 otherwise (1) 2. Interpolation & Smoothing We interpolated the data values for each matrix by tenfolding each single value and applying an exponential smoothing over the new time series. The interpolation is necessary since we have only 10 observations for each URI. Further, with the interpolation and smoothing we implicitly model the uncertainty of the event. We only know that a change event occurred between two observations, but not exactly when. This results in four matrices with 100 attributes for each URI. (Mevintpol = 26961 × 100 matrix) 3. Fourier Analysis: After the interpolation and smoothing of the data we performed a Fourier anal- ysis. The analysis resulted in 32 spectrums for each URI (Mevfourier = 26961× 32 matrix). 4. Join of Matrices by their URI: Finally, we joined the four matrices by their URIs which resulted in our final matrix (Mfinal = 26961× 128 matrix). 3http://code.google.com/p/ldspider/ 4http://rapid-i.com/content/view/181/190/ 161 4.3 Preliminary Results Figure 3 shows results analysing the number of clusters for the k-Means clustering with a centroid distance evaluation measure. The high number of different clusters came to our surprise. Clearly, a “soft” category approach with only a handful of clusters as sketched above cannot work out to capture these similarities satisfyingly. Moreover, after a brief manual inspection of the gained clusters, we can intu- itively reason about basic correlations, such as nodes from same domains are often found in same clusters. This under- lines the appropriateness of the applied methodology and motivates for further work in this direction towards actual correlation analysis. The gained clustering results provide a useful basis for these upcoming tasks. We also tried to perform an agglomerative clustering. Unfortunately, we could not gain any results with that due to performance issues. Regarding the actually small sample we used, this highlights the need for particularly scalable methods in the context of the huge LOD graph. Av g. w ith in c en tro id d ist an ce Number of Clusters 0 100 200 300 400 500 600 700 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Figure 3: Number of clusters and average centroid distance. 5 Conclusion & Future Work In this work, we presented our motivation and first ideas to study the dataset dynamics of the LOD Web using ma- chine learning and data-mining approaches. We strongly believe that we can encompass comprehensive details about attributes and features that trigger or relate changes of LOD resources. The gained knowledge can eventually be inte- grated into a wide range of Web-related applications. In an first attempt, we identified how we can model the time se- ries of changes and analyse its spectrum. Preliminary clus- tering results indicate the appropriateness of this approach and provide interesting first insights. We hope that this ini- tial report triggers a rich discussion about suited methods and promising approaches in the community. There exist many directions for future work. Clearly, we need a longer history of changes and thus we will continue our monitoring approach over the next years. Further, we will enrich our change matrix with more information. Pos- sible information includes the incoming labelled links for the nodes, the node state changes of the one hop surround- ing nodes and more features about the change event (type and fraction of change). Our future investigations include the following directions: Change correlations We will investigate siophisticated methods to identify correlations between the dynamics of nodes. Especially, data reduction techniques such as PCA or SVD are of high interest to us and we will explore how and to which extend we can apply them. Change classification Another area of high relevance for future work is to classify URIs into classes of dynamics. We are particularly interested to find the best features for the classification and we consider to use the outcome of the correlation analysis to increase the classification quality. Change prediction Knowing in advance at which time in the future a resource is very likely to change is of tremendous value for our use case. Thus, we will ex- plore machine-learning methods to compute the most likely change time based on observed change events. The quality of our predictions can also serve as an evaluation measure of the described approach. Dealing with incomplete history Eventually, we will ex- plore methods to deal with incomplete change histories. It is hard to monitor a resource constantly over a long time period. In a real world setup, a system will never be able to get a continuous history of change events of a resource. For instance, the window intervals might be too large and mul- tiple changes can happen between to snapshots. In such a case it is of high importance to be able to handle these missing events and still predict and classify accurately. Acknowledgements This work based upon works jointly supported by the Sci- ence Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2) and under Grant No. 08/SRC/I1407 (Clique: Graph & Network Analysis Cluster) and by the European Community 7th framework ICT-2007.4.4 (No 231519) ”e- Lico: An e-Laboratory for Interdisciplinary Collaborative Research in Data Mining and Data-Intensive Science”. References [Berners-Lee, 2006] Tim Berners-Lee. Linked data, July 2006. http://www.w3.org/DesignIssues/ LinkedData. [Cho and Garcia-Molina, 2000] Junghoo Cho and Hector Garcia-Molina. The evolution of the web and implica- tions for an incremental crawler. In VLDB, pages 200– 209, 2000. [Cho and Garcia-Molina, 2003] Junghoo Cho and Hector Garcia-Molina. Estimating frequency of change. ACM Trans. Internet Techn., 3(3):256–290, 2003. [Harth et al., 2010] Andreas Harth, Katja Hose, Marcel Karnstedt, Axel Polleres, Kai-Uwe Sattler, and Ju¨rgen Umbrich. Data summaries for on-demand queries over linked data. In World Wide Web Conference (WWW ’10), pages 411–420, 2010. [Haslhofer and Popitsch, 2009] B. Haslhofer and N. Popitsch. DSNnotify - detecting and fixing broken links in linked data sets. In DEXA ’09 Workshop on Web Semantics (WebS ’09), 2009. [Manola and Miller, 2004] Frank Manola and Eric Miller. RDF Primer. W3C Recommendation, February 2004. http://www.w3.org/TR/rdf-primer/. [Pandey et al., 2003] Sandeep Pandey, Krithi Ramam- ritham, and Soumen Chakrabarti. Monitoring the dy- namic web to respond to continuous queries. In World Wide Web Conference (WWW ’03), pages 659–668, 2003. [Umbrich et al., 2010] Ju¨rgen Umbrich, Michael Hausen- blas, Aidan Hogan, Axel Polleres, and Stefan Decker. Towards dataset dynamics: Change frequency of linked open data sources. In WWW ’10 Workshop on Linked Data on the Web (LDOW ’10), 2010. 162 Counting-based Output Prediction for Orphan Screening Katrin Ullrich1,2, Christoph Stahr1,3, Thomas Gärtner1,4 1Fraunhofer IAIS, Schloss Birlinghoven, 53757 Sankt Augustin 2B-IT Research School, 3University of Magdeburg, 4University of Bonn katrin.ullrich@iais.fraunhofer.de, christoph.stahr@student.uni-magdeburg.de, thomas.gaertner@iais.fraunhofer.de Abstract We investigate orphan screening, the search for small molecule ligands of proteins for which no binding ligands are known in advance. Pre- dicting interactions between biologically active molecules is an important step towards effec- tive drug discovery. We propose novel classifi- cation and ranking algorithms for orphan screen- ing which are based on counting feature combi- nations in molecular fingerprints. For the train- ing process we only use positive examples and additional knowledge about the considered pro- teins and ligands. This knowledge is available in form of protein similarity values a database of molecule compounds. Our algorithms have run- time linear in the number of unlabelled examples. 1 Introduction We aim at the prediction of ligands for orphan targets. Ligands denote small molecules binding proteins, whereas an orphan target is a protein or protein binding site for which no example ligands are available for learning. More generally, virtual screening denotes the in silico testing of huge databases of molecules for predefined properties, such as the activity against a target. Virtual screening is an important tool to support laboratory testing for the search of ligands. In particular, it is used to preselect promising molecules from the typically very large databases of synthesizable molecules in order to reduce the time and money needed to develop a novel drug. In recent years, machine learning techniques have been very successfully adopted for virtual screening. Whether ligands for the protein under consideration are available beforehand or not, distinguishes the cases of practical relevance. For traditional virtual screening some ligands of the protein are given and others are sought. For orphan screening no ligands of the protein are known and the task is to find some. In the latter case, the protein is called an orphan target. Virtual screening can be modelled as a classification problem. In contrast to approaches where one is inter- ested in the intensity of the protein ligand bond, in the classification setting we only want to find out whether a small molecule binds a protein or not. Approaches like the one described by Geppert et al. (2008) use support vector machines for non-orphan screening. They employ known ligands as positive training instances and a sample of database molecules as negative training instances. Building on these techniques, Jacob and Vert (2008) try to improve predictive accuracy with a hypothesis that targets several proteins at the same time. Additionally, they carry their virtual screening method over to orphan screening. In particular, they classify protein-ligand pairs as follows: a pair is considered as positive training instance if the ligand binds to the protein and as negative one, otherwise. Later, a somewhat different technique for both orphan and non-orphan screening was developed by Geppert et al. (2009). There, the authors realize their prediction via a combination of multiple hypotheses. While Geppert et al. (2009) show that their approach is more efficient, both approaches for orphan screening still suffer from two drawbacks: On the one hand, the database molecules, which in fact are molecules for which the label is not known, are assumed to be negative training instances. On the other hand, the amount of available database molecules is huge but the time complexity of these algorithms is cubic in the number of used database molecules. In this paper we develop a novel approach to virtual screening that is (i) applicable to orphan screening, (ii) does not assume database molecules to be negative instances but models them as unlabelled instances, and (iii) has time complexity linear in the number of database molecules. Our approach is derived from the counting- based structured output prediction algorithm proposed in Gärtner and Vembu (2009). Their method works efficiently if sums over features and pairs of features can be computed in linear time. The term structured output refers to the fact that the results of structured output algorithms are more complex than just labels of natural or real numbers. In our case the structured output will be a fingerprint, a feature vector which represents a compound in a database of molecules. Our aim is to adapt the method of Gärtner and Vembu (2009) for the problem of ligand search, as counting features in fingerprints is computationally easy and may lead to good results in this setting. Actually, we achieve the linear runtime complexity by precomputing the sums of feature vectors and their inner products over the database in linear time. 2 Preliminaries Small molecules can be represented as bit vectors of zeros and ones in Rd via molecular fingerprints. The developers of these fingerprints assembled a set of d features in order to display the molecules by means of their chemical properties. Fingerprints have already been established as 163 standard tools in ligand prediction tasks. Kernel methods are a popular class of learning algo- rithms. They include techniques like the support vector machine that are well-founded in learning theory and have been applied in many real-world problems. They work by applying a symmetric and positive semidefinite function kX : X × X → R as a similarity measure for elements of a space X . A function with these properties is also called kernel function and is known to uniquely generate a reproducing kernel Hilbert space HX of real-valued functions defined on X (see Schölkopf et al. (2001)). In recent years, kernel methods have been extended to handle more complex – so-called structured – outputs. For this purpose we want to learn a joint scoring function f which is defined on the cross-product of two spaces X and Y . The function f should assign a real score to pairs (x, y) ∈ X × Y according to "how well x and y fit". Applied to our situation, we want to find ligands for proteins from a set X by screening a huge set of small molecules stored in a database Y . A common choice, which we will also adopt in this work, is to choose the tensor product of the input and output reproducing kernel Hilbert spaces H = HX ⊗ HY as hypothesis space for the joint scoring function. The function space H has the kernel k = kX ⊗ kY given by k ( (x, y), (x′, y′) ) = kX (x, x′) · kY(y, y′), for (x, y), (x′, y′) ∈ X × Y , where kX and kY are the kernels of HX and HY , respectively. We will assume that the feature mapping φ : Y → Rd is low dimensional and given explicitly. This is indeed naturally the case for vir- tual and orphan screening where we use fingerprint vectors of length d as molecule representations. In practice, the positive semidefinite kernel kX is calculated essentially as a DNA sequence identity for two proteins and ranges from 0 to 100. Furthermore, we consider the inner product of feature vectors as the kernel kY in the sense of kY(y, y′) = 〈φ(y), φ(y′)〉, for y, y′ ∈ Y . 3 The Method In the sequel we propose new loss-functions adapted to the problem of ligand prediction with a database of unlabelled examples and few positive training instances. Concerning the solution of the respective optimization problem we will benefit from the properties of reproducing kernel Hilbert spaces. In fact, our minimizer will turn out to be a linear combination of kernel functions evaluated at the training examples. In addition to orphan screening, our algorithm can also be applied to virtual screening for non-orphan tar- gets, i.e., the search for new ligands of proteins with a few known ligands. The main algorithmic difference be- tween both tasks is the utilization of positive training exam- ples. For orphan screening we exploit ligand information of other proteins and a similarity measure between those pro- teins (used for training) and the orphan target. Contrary to orphan screening, for virtual screening of non-orphan tar- gets we may additionally use the information we already have for the target in consideration. 3.1 Counting-Based Structured Output Prediction We are considering a set of proteins X and a database of small molecules Y , where the latter contains the potential ligands. In general, we are looking for a joint scoring func- tion f : X × Y → R, which assigns a real value to every pair (x, y) ∈ X × Y . This scoring value should be the higher the better ligand y binds to protein x. The hypothe- sis space for f shall be the reproducing kernel Hilbert space H = HX⊗HY with product kernel k = kX⊗kY . Suppose we have positive training examples (x1,Y1), . . . , (xm,Ym) ∈ X × 2Y , i.e., a set Yi of known ligands for every protein xi ∈ {x1, . . . , xm} = Xtrain ⊆ X . Moreover, we assume the availability of further knowledge about the candidate space Y . For our algorithm this knowledge consists of a database containing the elements of Y in form of feature vectors. Note, that we do not consider the exponentially large set of all possible molecule representations. With unlabelled ex- amples we denote every possible combination (xi, z) such that xi ∈ Xtrain and z ∈ Zi, with Zi := Y \ Yi. These pairs are unlabelled examples as we do not know before- hand whether z binds to xi or not. For orphan screening the considered orphan target x˜ is an element of X \Xtrain, while a non-orphan target is contained in Xtrain by con- struction of the two different tasks. Finally, our goal is a sorted list of the compounds in Y with respect to a certain protein x˜. Therefore, we want to assign high values of the scoring function f to positive examples and small values to the probably negative unlabelled examples, respectively. We will present different loss-functions which fulfill this requirement (and depend on both positive and unlabelled examples) in the next section. In principle, we want to learn a function f∗ ∈ H which minimizes the regularized risk functional f∗ = argmin f∈H λ‖f‖2H + m∑ i=1 R(f, i) (1) where R(·, i) is the loss-function for the i-th positive training example. Since the quadratic function is strictly monotonically increasing and the regularized risk term∑m i=1R(f, i) specified below is a function with inputs (xi, y) and (xi, z) such that xi ∈ Xtrain, y ∈ Yi, and z ∈ Zi, i = 1, . . . ,m, we may apply the representer theo- rem (Schölkopf et al., 2001). We obtain for our optimizer f∗ that f∗ ∈ F = span { kX (·, x) · kY(·, z) : x ∈ X , z ∈ Y } = span { kX (·, x) · 〈φ(·), φ(z)〉 : x ∈ X , z ∈ Y } , where kX (·, ·) is the adequate protein similarity value, and 〈φ(·), φ(·)〉 is the inner product in the output space with dimension d. Usually, Y is given as a huge database of in- stances. For this reason we would like to scale down F . In practice the dimension d of the feature space is often much smaller than the number of elements in Y (e.g. fin- gerprint maccs has d ≈ 160, while |Y| is usually greater than 100.000). If we consider the canonical orthonormal 164 basis e1, . . . , ed of Rd, the transformation F = span { kX (·, x) · 〈φ(·), φ(z)〉 : x ∈ Xtrain, z ∈ Y } = span { kX (·, x) · 〈φ(·), ∑d l=1 βzl el〉 : x ∈ Xtrain, z ∈ Y, βzl ∈ R } = span {∑d l=1 βzl kX (·, x) · 〈φ(·), el〉 : x ∈ Xtrain, z ∈ Y, βzl ∈ R } shows that f∗(x, y) = m∑ i=1 d∑ l=1 αl,ikX (x, xi)〈φ(y), el〉 (2) for appropriate coefficients αl,i ∈ R. Now we have to op- timize f∗ over only m · d coefficients (instead of m · |Y| coefficients as in the initial representation of F). We will find the solution f∗ of (1) by minimizing with respect to α ∈ Rd×m. Following, we propose new loss-functions for ligand prediction which finally determines the optimization problem (1) completely. 3.2 Loss-Functions for Orphan Screening As mentioned above, we want to learn a function f that maps positive examples to higher values than unlabelled ones. For the associated optimization problem we present different loss-functions and contrast them with each other. The loss-functions are all based on quadratic loss (compare Gärtner and Vembu (2009)). They vary in the treatment of labelled and unlabelled examples, respectively. Suppose we have a set Xtrain = {x1, . . . , xm} of proteins and a database of molecules Y . Furthermore, for every xi ∈ X we have a set Yi of already verified ligands and the complement set of unlabelled database molecules Zi with respect to xi. We will refer to (xi,Yi) as the i-th training unit. We propose to minimize the regularized risk functional (1) with regularization constant λ > 0 and training units 1, . . . ,m, for the following loss-functionsR1, . . . ,R4 (ranking simple) R1(f, i) = ∑ y∈Yi ∑ z∈Zi [1− fi(y) + fi(z)]2 (classification simple) R2(f, i) = ∑ y∈Yi [1− fi(y)]2 + 1|Zi| ∑ z∈Zi [1 + fi(z)] 2 (ranking mean) R3(f, i) = ∑ y∈Yi [ 1− fi(y) + 1|Zi| ∑ z∈Zi fi(z) ]2 (classification mean) R4(f, i) = ∑ y∈Yi [1− fi(y)]2 + [ 1 + 1|Zi| ∑ z∈Zi fi(z) ]2 , where fi(y) := f(xi, y). The loss-functions R1, . . . ,R4 are categorized according to their treatment of unlabelled examples (simple/mean) and to their arrangement of examples via classification or ranking (classification/ranking). The functions R2 and R4 push the positive and unlabelled examples for every protein xi ∈ X to scores of +1 or −1, respectively. In contrast, R1 and R3 rank ligands in an independent fashion for every training example xi ∈ Xtrain. Another qualitative differentiating feature is that R1 and R2 are constructed such that every single unlabelled example is ranked or classified. In contrast, R3 and R4 handle unlabelled examples as a unit. Only the mean of their values of f is supposed to be lower than the values of f for positive examples. For our data setting we would expect the classification mean approach to work best. We believe this because, on the one hand, we have classification-type data. On the other hand, it should be more appropriate to pull the mean of the values of the scoring function for unlabelled examples down. Actually, the unlabelled examples contain ligands which, by construction, are supposed to have high values of f . An interesting property of the optimization problems above is that they are equivalent to using the second order Taylor approximation of the exponential loss. For example, consider minλ‖f‖2 + m∑ i=1 ∑ y∈Yi ∑ z∈Zi [1− fi(y) + fi(z)]2 and minλ‖f‖2 + m∑ i=1 ∑ y∈Yi ∑ z∈Zi exp[1− fi(y) + fi(z)]. Both optimization problems are equivalent except for a re- moval of multiplicative or additive constants resulting in a shift of the objective function. In the sequel we will sketch how to obtain compact formu- las for all functions necessary for optimization. 3.3 Objective Function and Algorithm In order to solve the optimization problem in (1), according to the parameterization given by (2), we have to minimize with respect to α ∈ Rd×m. Let us define the terms Y ∈ Rm×d, C ∈ Rd×d, and Φ ∈ Rd by Yi· := ∑ y∈Yi φT (y) , C := ∑ z∈Y φ(z)φ(z)T , Φ := ∑ z∈Y φ(z), the kernel matrix of protein similarities, [K]mi,j=1 := kX (xi, xj), and the vectors of constants V := (|Y1|, . . . , |Ym|)T , W := (|Y| − 2|Y1|, . . . , |Y| − 2|Ym|)T , S := ( |Y1| |Zi|2 , . . . , |Ym| |Zm|2 )T , S := ( 1 |Zi|2 , . . . , 1 |Zm|2 )T , U := ( |Y1| |Zi| , . . . , |Ym| |Zm| )T , U := ( 1 |Zi| , . . . , 1 |Zm| )T . As the definitions of C and Φ include counts of fingerprint features or feature combinations in linear time, it becomes clear why our method is called counting-based. These constants can be computed in linear time by a single pass 165 over the database. With this constants we obtain a compact formulation (independent of the size of the database) of the objective functions o(α,R1), . . . , o(α,R4) with the respective loss-functionsR1, . . . ,R4 o(α,R1) = λ tr(αKαT )− |Y| tr(Y αK) + V TKαTΦ + 1 2 m∑ i=1 Wi(αK·i)TCYi(αK·i) + 1 2 V T diag(KαTCαK) − ΦTαK diag(Y αK) + ‖diag(Y αK)‖2 o(α,R2) = λ tr(αKαT )− (1 + U)T diag(Y αK) + U T KαTΦ+ 1 2 U T diag(KαTCαK) + 1 2 m∑ i=1 (1− U)i(αK·i)TCYi(αK·i) o(α,R3) = λ tr(αKαT )− (1+ U)T diag(Y αK) + UTKαTΦ+ 1 2 m∑ i=1 (αK·i)TCYi(αK·i) + 1 2 ΦTαK(S ◦KαTΦ) − ΦTαK((S + U) ◦ diag(Y αK)) + ∥∥∥∥∥ (√ 1 2 S + U ) ◦ diag(Y αK) ∥∥∥∥∥ 2 o(α,R4) = λ tr(αKαT )− (1+ U)T diag(Y αK) + U T KαTΦ+ 1 2 m∑ i=1 (αK·i)TCYi(αK·i) + 1 2 ΦTαK(S ◦KαTΦ)} − ΦTαK(S ◦ diag(Y αK)) + ∥∥∥∥∥ (√ 1 2 S ) ◦ diag(Y αK) ∥∥∥∥∥ 2 . With standard techniques we also obtained the gradient of the objective function and the product of the Hessian with a vector v ∈ Rd×m which we need for optimization. Due to the positive semidefinite kernel matrix K and the particular structure of the objective function, we deal with a convex optimization problem. Hence, we can use the Newton conjugate gradient method to obtain a solution for our minimization problems. Our ligand prediction algorithm works as follows for a given target protein x: After learning the scoring func- tion f we calculate the values f(x, z) for every database molecule z ∈ Zx (note, orphan screening and virtual screening for non-orphan targets only vary in the data used for training). Afterwards, all those molecules are sorted with respect to their score. The performance of the algorithm can be measured by recovery rates, i.e., the number of actual ligands among the first s sorted molecules (compare also Geppert et al., 2009), where s is an appropriate threshold. 4 Conclusion and Future Work In this paper we developed several counting-based struc- tured output prediction algorithms for orphan screening as well as traditional virtual screening. For that we established four loss-functions evaluating the quality of a scoring function f . The algorithms aim at an assignment of scores to small molecules in a database according to how good they bind a target protein, in particular, an orphan target. The loss-functions are constructed such that: After a sorting of the database with respect to the scoring values, ligands of the target protein should be in the top positions of the sorted list of molecules. Furthermore, the design of the loss-functions shows our intention to model the database molecules as unlabelled examples. We will measure the accuracy of our approach in real-world virtual as well as orphan screening settings. Concretely, we want to test our four methods on a set of proteins, each with a set of known ligands. Furthermore, we plan to vary the algorithm. For example, we want to include actual negative examples, insert an additional variance term for the neutral examples, or invert input and output space. Moreover, an important part of our practical studies will be the comparison of our algorithm with SVMStruct and baseline orphan screening methods. In addition to the classification setting, we want to test regression approaches. Acknowledgments This work was supported in part by the German state NRW within the B-IT Research School and by the German Re- search Foundation (DFG) under the reference number ‘GA 1615/1-1’. References D. Erhan, P.-J. L’Heureux, S. Y. Yue, and Y. Bengio. Collaborative filtering on a family of biological targets. Journal of Chemical Informatics and Modelling, 46: 626–635, 2005. T. Gärtner and S. Vembu. On structured output training: hard cases and an efficient alternative. Machine Learn- ing Journal (Special Issue of ECML PKDD), 76(2):227– 242, 2009. H. Geppert, T. Horváth, T. Gärtner, S. Wrobel, and J. Ba- jorath. Support-vector-machine-based ranking signifi- cantly the effectiveness of similarity searching using 2d fingerprints and multiple reference compounds. Journal of Chemical Informatics and Modelling, 48:742–746, 2008. H. Geppert, J. Humrich, D. Stumpfe, T. Gärtner, and J. Bajorath. Ligand prediction from protein sequence and small molecule information using support vector ma- chines and fingerprint descriptors. Journal of Chemical Informatics and Modelling, 49:767–797, 2009. L. Jacob and J.-P. Vert. Protein-ligand interaction predic- tion: an improved chemogenomics approach. Bioinfor- matics, 24(19):2149–2156, 2008. B. Schölkopf, R. Herbrich, and A. J. Smola. A Generalized Representer Theorem. In Computational Learning The- ory, volume 2111 of Lecture Notes in Computer Science, pages 416–426. Springer, Berlin, 2001. 166 Workshop Information Retrieval WIR 2010 Ingo Frommholz University of Glasgow Scotland, UK Claus-Peter Klas Fernuniversita¨t Hagen Germany The IR Workshop The ubiquity of search systems has led to the application of information retrieval technology in many new contexts (e.g. mobile and international) and for new object types (products, patents, music). To develop appropriate prod- ucts, basic knowledge on information retrieval needs to be revisited and innovative approaches need to be applied, for example by allowing for user interaction or by taking the user’s situational context into account. The quality of in- formation retrieval needs to be evaluated for each context. Large evaluation initiatives respond to these challenges and develop new benchmarks. The workshop Information Retrieval 2010 of the Special Interest Group on Information Retrieval within the German Gesellschaft fu¨r Informatik (GI) provides a forum for sci- entific discussion and the exchange of ideas. The work- shop takes place in the context of the LWA ”Learning, Knowledge and Adaptivity” workshop week (LWA, Oct 4-6, 2010) at the University of Kassel in Germany. This workshop continues a successful series of conferences and workshops of the Special Interest Group on Information Retrieval. The workshop addresses researchers and prac- titioners from industry and universities. Especially Doctor- ate and Master students are encouraged to participate. WIR 2010 The following types of submissions were requested this year: • Full Papers (8 accepted submissions) • Short Papers (2 accepted submissions): Position pa- pers or work in progress • Poster and Demonstrations (1 accepted submissions): Poster and presentation of ideas, systems or proto- types The following areas were covered by the workshop: • Development and optimization of retrieval systems • Retrieval with structured and multimedia documents • Evaluation and evaluation research • Text mining and information extraction • Multilingual systems • Digital libraries • User interfaces and user behavior • Interactive IR • Combination of structured and unstructured search • Machine learning in information retrieval • Information retrieval and knowledge management • Information retrieval and the semantic web • Search Engine Optimization • Social Search Program committee The program committee had the following members (in al- phabetical order): • Prof. Dr. Maximilian Eibl, TU Chemnitz • Prof. Dr. Reginald Ferber, Hochschule Darmstadt • Dr. Ingo Frommholz, University of Glasgow, UK (Chair) • Prof. Dr. Norbert Fuhr, Universita¨t Duisburg-Essen • Rene´ Hackl, FIZ Karlsruhe • Prof. Dr. Andreas Henrich, Universita¨t Bamberg • Frank Hopfgartner, University of Glasgow, UK • Dr. Claus-Peter Klas, Fernuniversita¨t Hagen (Chair) • Dr. Michael Kluck, Stiftung Wissenschaft und Politik, Berlin • Dr. Sascha Kriewel, Universita¨t Duisburg-Essen • Dr. Johannes Leveling, Dublin City University, Ire- land • Dr. Thomas Mandl, Universita¨t Hildesheim • Dr. Wolfgang Mu¨ller, EML Research • Dr. Gu¨nter Neumann, DFKI • Prof. Dr. Vivien Petras, Humboldt-Universita¨t, Berlin • Prof. Dr. Marc Rittberger, DIPF, Frankfurt am Main • Dr. Thomas Roelleke, Queen Mary University of Lon- don, UK • Dr. Ralf Schenkel, Universita¨t des Saarlandes, Saarbru¨cken • Prof. Dr. Ingo Schmitt, TU Cottbus • Prof. Dr. Benno Stein, Universita¨t Weimar • Dr. Ulrich Thiel, Fraunhofer IPSI, Darmstadt • Prof. Dr. Gerhard Weikum, Max-Planck-Institut fu¨r Informatik • Prof. Dr. Judith Winter, Fachhochschule Frankfurt am Main • Prof. Dr. Christian Wolff, Universita¨t Regensburg • Prof. Dr. Christa Womser-Hacker, Universita¨t Hildesheim 167 We would like to thank the authors for their submissions, and we also thank the members of the program committee for providing helpful constructive reviews. Kassel, October 2010, Ingo Frommholz and Claus-Peter Klas 168 Language Models, Smoothing, and IDF Weighting Najeeb Abdulmutalib, Norbert Fuhr University of Duisburg-Essen, Germany {najeeb|fuhr}@is.inf.uni-due.de Abstract In this paper, we investigate the relationship be- tween smoothing in language models and idf weights. Language models regard the relative within-document-frequency and the relative col- lection frequency; idf weights are very similar to the latter, but yield higher weights for rare terms. Regarding the correlation between the language model parameters and relevance for two test col- lections, we find that the idf type of weighting seems to be more appropriate. Based on the ob- served correlation, we devise empirical smooth- ing as a new type of term weighting for lan- guage models, and retrieval experiments confirm the general applicability of our method. Finally, we show that the most appropriate form of de- scribing the relationship between the language model parameters and relevance seems to be a product form, which confirms a language model proposed before. 1 Introduction Since several years, language models are the preferred type of IR models [Hiemstra, 1998; Ponte and Croft, 1998; Berger and Lafferty, 1999]. In contrast to other models, they explicitly include a document indexing model that re- lates the within-document frequency of a term to its index- ing weight. On the other hand, there is no explicit notion of probability of relevance. Closely related to this state- ment, there is the somewhat unclear relation between tf*idf weighting (like e.g. in the classic vector space model or in BM25) and the probabilistic parameters of language mod- els. In this paper, we present some empiric results that relate language models to tf*idf weights, which leads us to a new smoothing method giving us good retrieval results. 2 Language Models Language models regard a text in form of a sequence of words as a stochastic process. Thus, for a given vocabu- lary (set of terms) T , a language model θ is defined as a probability distribution θ = {(ti, P (ti|θ)|ti ∈ T )} with ∑ ti∈T P (ti|θ) = 1 In the most simple form, one assumes independence of term occurrences, and thus the probability of a document text d = t1t2t3 . . . tm wrt. to language model θ can be computed as P (d|θ) = ∏mj=1 P (tj |θ). The basic idea for defining a retrieval function is to com- pare the document’s d language models to that of the query q. One way for doing this is to compute the probability that the query was generated by the document’s language model: P (q|d) ≈ ∏ ti⊆qT P (ti|d) = ∏ ti∈qT∩dT Ps(ti|d) ∏ ti∈qT−dT Pu(ti|d) = ∏ ti∈qT∩dT Ps(ti|d) Pu(ti|d) ∏ ti∈qT Pu(ti|d) (1) Here dT denotes the set of terms occurring in the document, and qT refers to the set of query terms. Ps(ti|d) denotes the probability that the document is about ti, given that ti oc- curs (is seen) in the document. On the other hand, Pu(ti|d) denotes the same probability for those terms ti not occur- ring (is unseen) in the document. The estimation of these parameters suffers from the problem of sparse data. Thus, a number of smoothing methods have been developed. Let F denote the total num- ber of tokens in the collection and cf(t) the collection fre- quency of term t, l(d) the number of tokens in document d and tf(t, d) the corresponding within-document frequency of term t . Then we estimate Pavg(t) = cf(t) F and Pml(t|d) = tf(t, d) l(d) where Pavg(t) is the average relative frequency of t in the collection, and Pml(t|d) is the maximum likelihood esti- mate for the probability of observing t at an arbitrary posi- tion in d. Various smoothing methods have been developed in the past. In this paper, we only regard the most popular one, namely Jelinek-Mercer (JM) smoothing [Jelinek and Mer- cer, 1980]: Ps(ti|d) = (1− λ)Pml(t|d) + λPavg(t) (2) Here λ (with 0 ≤ λ ≤ 1) is a global smoothing parame- ter that allows for collection-specific tuning. For the unseen terms, [Zhai and Lafferty, 2001] propose the following es- timate: Pu(ti|d) = αdPavg(t) with αd = 1−∑ti∈qT∩dT Pavg(t) 1−∑ti∈qT∩dT Pml(t|d) 169 As we can see from eqn 2, the term weight is a weighted sum of its relative within-document frequency Pml and its relative frequency in the whole collection, Pavg . Classic tf*idf weighting formulas are based on the same parame- ters. However, whereas the tf part of these types of weights usually is some monotonic transformation of Pml, the idf part is the negative logarithm of the document frequency, i.e. similar to − log(Pavg). (Note, however, that Pavg refers to tokens, whereas the idf weight regards the number of documents in which a term occurs. A theoretic treatment of this aspect can be found in [Roelleke and Wang, 2008]. Here we assume that this difference is negligible.) The the- oretic justification of idf weights goes back to [Croft and Harper, 1979], who showed that the relative document fre- quency is an estimate for the probability of the term occur- ring in a nonrelevant document, thus linking this parameter to relevance. Thus, we have the language model interpre- tation of Pavg on one hand, and the relevance-oriented in- terpretation of the idf weight on the other hand. In the for- mer, the weight of a term grows monotonically with Pavg, whereas the opposite is true for idf weights. Although [Roelleke and Wang, 2008] shows how the two kinds of weightings relate to each other, this paper does not resolve the apparent contradiction. 3 Language model parameters vs. probability of relevance As an alternative to a theoretic treatment, we performed an empirical study on the distribution of the language model parameters Pml and Pavg in relevant and nonrelevant doc- uments. For that, we regarded two test collections: 1. The INEX 2005 collection consisting of 16819 journal articles (764 MB), where we regard each of the 21.6 million XML elements as a document1. Due to this document definition, we have a great variation in doc- ument lengths, as is illustrated in figure 1. As query set, we use the corresponding 29 content-only queries. 2. The AP part of the TREC collection containing 240,000 documents, along with TREC queries 51-100 and 101-150. For computing our statistics for the given query sets , we considered all query-document pairs where the document 1Retrieval of XML elements can be used as a first step in a two-stage process for focused XML retrieval, where the second step picks the most specific elements from each XML document that answer the query in the most exhaustive way. Figure 1: Distribution of document lengths in INEX contains at least one query term. In case the document is relevant, all query terms are regarded as being relevant for this document otherwise all terms are irrelevant. Now we aim at relating the (Pml, Pavg) pairs of terms to their prob- ability of relevance P (R|t) that a document containing t will be judged relevant to a random query containing t as query term. For that, we perform an appropriate binning of (Pml, Pavg) pairs into two-dimensional intervals, and then we compute the ratio of relevant pairs among all pairs in an interval. Figure 2 shows the corresponding statistics for the INEX collection2. At first glance, we already see that this statis- tics confirms the tf*idf heuristics: the higher Pml and the smaller Pavg , the higher P (R|t). Moreover, Pavg is dom- inating and Pml has only a minor effect: For any given Pavg interval, the P (R|t) values are roughly all in the same order of magnitude (ignoring the case where Pml = 0), whereas for any Pml interval the P (R|t) values vary by several orders of magnitude. This observation contrasts with the standard justification of smoothing methods in lan- guage models, where it is said that Pml is the dominating factor and Pavg is used only for dealing with data spar- sity. The results also show that for Pml = 0 (terms not occurring in the document), P (R|t) is much smaller than for Pml > 0. For higher values of Pavg , P (R|t) seems to be zero. However, using a logarithmic scale, we can see that P (R|t) decreases monotonically when Pavg increases. The corresponding results for the TREC collection are shown in figure 3. The major difference to the TREC col- lection is that in TREC, the slope in the Pavg direction is not as high as in INEX. One possible explanation could be the fact that the relevance definition used in TREC is less strict than the INEX one. Furthermore, for terms not occurring in the document, there is only a minor P (R|t) difference in comparison to those having low Pml values. Overall, these empirical observations confirm the domi- nant role of Pavg wrt. retrieval quality. This is in stark con- trast to the standard language model justification, saying that Pml is more important and Pavg only helps in smooth- ing. 4 Implementing empirical smoothing Based on the observations described above, we now want to propose a new approach for smoothing, which we call em- pirical smoothing. The basic idea is already illustrated in figures 2–3: For each possible combination of (Pml, Pavg) values of a term, these plots show the corresponding prob- ability P (R|t). So it seems straightforward to use these values as result of the smoothing process. In principle, there are three different ways for imple- menting this idea: Direct use of interval values: As outlined above, we can directly use the probability estimates of P (R|t) from figures 2-3. Thus, given a (Pml, Pavg) pair, we de- termine the corresponding 2-dimensional interval, and then look up its P (R|t) value from the training set. However, this method needs large amounts of training data to avoid overfitting. Moreover, it does not give us any insights into the relationship between (Pml, Pavg) and P (R|t). Application of probabilistic classification methods: This approach has been investigated already in [Fuhr 2In order to derive a meaningful statistics, elements with less than 100 words were not considered here. 170 Figure 2: P (R|t) for different (Pml, Pavg) values (INEX), linear/log scale Figure 3: P (R|t) for different (Pml, Pavg) values (TREC) and Buckley, 1991]. As input, the machine learning method would use the raw data underlying the figures from above, i.e., for each term in each query- document pair considered, we have a training instance consisting of the Pml and Pavg values as features and the relevance decision as class variable. In recent years, this kind of approach has also become very pop- ular for developing retrieval functions in the so called ’learning to rank’ approaches (see e.g. [Fuhr, 1989; Liu, 2009]). Like the previous method, however, this approach operates like a black box, giving us no further insights. Application of numeric prediction: Here we start with the data shown in figures 2 - 3, and now seek for a function that describes the relationship between Pml, Pavg and P (R|t). As classic smoothing functions perform the same kind of task, we can compare the outcome of the machine learning method with these functions. From these three possibilities, we only consider the last one in the following. Furthermore, we only regard the most simple variant of numeric prediction, namely linear regres- sion. 5 Linear regression First, we use a purely linear function of the form: Ps(ti|d) = αPml + βPavg + γ (3) As a second variant, we start from the observation in fig- ure 2 that a linear function of Pavg may not be very appro- priate. Therefore we use log(Pavg) instead: Ps(ti|d) = αPml + β log(Pavg) + γ (4) Table 1 shows the actual coefficients which have been predicted using linear regression, along with the average squared error. As we can see, replacing Pavg by its loga- rithm (LR linear vs. LR log) reduces the error substantially for both collections. For further analysis, we regard the difference between the linear predictions of equation 3 and the actual P (R|t) values, as illustrated in figure 4 for the INEX collection (for TREC, the figure looks very similar). In the ideal case, there would be random errors; instead, these figures show 171 Table 1: Coefficients derived using linear regression Method Collection α β δ γ Error LR linear INEX 0.97 -60.43 0.12 0.053 LR log INEX -9.12 -2 9.7 0.011 LR quadratic INEX 0.97 -209.58 41064.69 0.18 0.022 LR linear cnst.=0 INEX 2.59 -23.4 0 0.060 LR linear TREC 1.07 -6.93 0.13 0.091 LR log TREC -6.23 -0.5 3.43 0.012 LR quadratic TREC 1.07 -28.03 660.81 0.16 0.041 LR linear cnst.=0 TREC 2.65 -2.69 0 0.094 Figure 4: Residuals for linear regression (INEX) us systematic deviations from the predicted values. The distribution of these errors suggests that a quadratic func- tion of Pavg would be more appropriate: Ps(ti|d) = αPml + βPavg + δP 2avg + γ (5) Looking at the corresponding quadratic errors in table 1 (LR quadratic), we see that the quadratic form is better than the linear one, and rates as good as the variant with log(Pavg). Since JM smoothing also uses a linear function with Pml and Pavg as inputs, we want to compare its outcome with that of our linear regression. For that, we used the equa- tion 2 with λ = 0.7 which gave the best results for this method. For better comparison, we also tried a variant of the regression function 3, where we dropped the constant γ, so in this case it has the same structure as the JM smoothing function. However, looking at the corresponding regression coefficients listed in table 1 (LR linear cnst=0), we see that Pavg has a negative coefficient, whereas JM smoothing as- sumes both coefficients to be positive. In JM smoothing, Pml is the dominating factor (although Pavg has a higher weight with λ = 0.7, it is at least an order of magnitude smaller than Pml), whereas the empirical data as well as the result of our regression put major emphasis on Pavg , and Pml just serves as a minor correction factor. 6 Retrieval experiments Finally, we performed retrieval experiments with the re- trieval function 1 and the various regression functions and compared them with standard retrieval functions. The re- sults are depicted in table 2 and figure 5. For the three variants of linear regression, we did not separate between training and test sample, so their results are a bit optimistic. Only for the purely linear form, we per- formed experiments with 2-fold cross validation (LR linear (cv)), showing that the choice of the training sample has little effect on the quality of results. Comparing the results of the three variants of linear re- gression, we can see that for both collections, already the linear form gives good results, which can be improved by using one of the variants. For INEX, log(Pavg) gives the best quality overall, whereas the quadratic form yields im- provements for the top ranking elements only. With TREC, both the logarithmic and the quadratic form are much bet- ter than the linear one. In both cases, the quality of JM smoothing is comparable to that of the linear form. BM25 performs poorly for INEX, but very good for TREC. Furthermore, we also present results for our odds-like language model presented in [Abdulmutalib and Fuhr, 2008], where the retrieval function is shown in eqn. (6); as estimate of P (d)/P (d¯), we use the ratio of the length of d and the average document length, and ω and γ are tuning parameters for smoothing. ρo,e(q, d) = ∏ ti∈qT∩dT ( Pml(ti|d) Pavg(ti|C) )ω · ∏ ti∈qT−dT Pavg(ti|C)γ · P (d) P (d¯) (6) 172 Table 2: Retrieval results: empirical smoothing vs. standard retrieval methods (INEX / TREC) Method MAP P@5 P@10 P@20 LR linear 0.0729 0.355 0.339 0.334 LR log 0.1004 0.397 0.366 0.315 LR quadratic 0.0668 0.389 0.389 0.359 JM 0.0667 0.303 0.245 0.216 LR linear (cv) 0.0862 0.331 0.324 0.299 Odds 0.0800 0.348 0.348 0.323 ZL 0.0780 0.338 0.324 0.307 BM25 0.0063 0.096 0.087 0.070 Method MAP P@5 P@10 P@20 LR linear 0.0286 0.283 0.253 0.213 LR log 0.0633 0.359 0.312 0.273 LR quadratic 0.0654 0.304 0.247 0.222 JM 0.0307 0.214 0.238 0.231 LR linear (cv) 0.0355 0.345 0.339 0.333 Odds 0.0572 0.232 0.211 0.191 ZL 0.0611 0.279 0.233 0.228 BM25 0.0844 0.445 0.432 0.352 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.2 0.4 0.6 0.8 1 Pr ec isi on Recall LR linear LR log LR quadratic JM BM25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.2 0.4 0.6 0.8 1 Pr ec isi on Recall LR linear LR log LR quadratic JM BM25 Figure 5: Recall-precision graphs for various smoothing methods and BM25 (INEX / TREC) -1.5 -2.0 -2.5 -3.0 -10 -3.5 -10.5 -9 -4.0 -10.0 log(Pm l) -9.5 -4.5 -8 -9.0 -5.0 -7 -8.5 -5.5-8.0 -6 log(P(R) -7.5log(Pavg) -6.0 -5 -7.0 -6.5 -6.5 -4 -3 -2 -1 -0 -9 log(P(R) -8 -7 -16 -6 -5 -4 -3 -2 -1 -0 -14 -12 -10log(Pavg) -8 -6 -4 -1.5 -2.0 -2.5 -3.0 -3.5-4.0 -4.5 -2 -5.0 -5.5 -6.0 log(Pm l) Figure 6: log(P (R|t)) as function of log(Pml) and log(Pavg)) (INEX / TREC) 173 Overall, these results show that empirical smoothing in combination with nonlinear regression functions is superior to classic smoothing methods. 7 Further analysis Figures 2–3 illustrating the relationship between Pml, Pavg and P (R|t) do not plot the first two dimensions in propor- tion to their size. In contrast, figure 6 uses a logarithmic scale for all three dimensions and also plots them propor- tionally. These figures indicate that the relationship could be fairly well described by a plane in the log-log-log space. In fact, looking at the odds-like-retrieval function (6), this is exactly the form that would result from such a plane (modulo the document length component). Based on this function, we also performed a few experiments with logis- tic regression, but the results were inferior to that of a grid search for the parameters ω and γ [Abdulmutalib, 2010, sec. 7.8]. 8 Conclusion and Outlook In this paper, we have investigated the relationship between smoothing in language models and idf weights. Although the relative collection frequency Pavg and idf weights are very similar, there is a contradiction in the weighting strat- egy. Regarding the correlation between the language model parameters and relevance, we find that the idf type of weighting seems to be more appropriate. Based on the observed correlation, we have devised empirical smooth- ing as a new type of term weighting for language models, and retrieval experiments confirm the general applicability of our method. Finally, we showed that the most appro- priate form of describing the relationship between the lan- guage model parameters and relevance seems to be a prod- uct form, which confirms a language model proposed by us before. In this paper, we have not considered the influence of document length. In fact, other smoothing methods like Dirichlet smoothing or absolute discount (see e.g. [Lafferty and Zhai, 2001]) consider this parameter. Thus, empirical smoothing could also be extended to document length as third parameter. The comparison between theoretic models and empirical data in this paper has brought us interesting observations. However, this comparison does not answer the question why JM smoothing gives fairly reasonable retrieval results, its structure contradicts our empirical findings. A reason- able explanation for this effect remains the subject of fur- ther research. References [Abdulmutalib and Fuhr, 2008] Najeeb Abdulmutalib and Norbert Fuhr. Language models and smoothing methods for collections with large variation in document length. In A M. Tjoa and R. R. Wagner, editors, DEXA Work- shops, pages 9–14. IEEE Computer Society, 2008. [Abdulmutalib, 2010] Najeeb Abdulmutalib. Language models and smoothing methods for information re- trieval. PhD thesis (submitted), 2010. [Berger and Lafferty, 1999] Adam Berger and John Laf- ferty. Information retrieval as statistical translation. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval, pages 222–229, New York, 1999. ACM. [Croft and Harper, 1979] W. B. Croft and D. J. Harper. Using probabilistic models of document retrieval with- out relevance information. Journal of Documentation, 35:285–295, 1979. [Fuhr and Buckley, 1991] Norbert Fuhr and Chris Buck- ley. A Probabilistic Learning Approach for Document Indexing. ACM Trancactions on Information Systems, 9(3):223–248, 1991. [Fuhr, 1989] N. Fuhr. Optimum polynomial retrieval func- tions based on the probability ranking principle. ACM Transactions on Information Systems, 7(3):183–204, 1989. [Hiemstra, 1998] Djoerd Hiemstra. A linguistically mo- tivated probabilistic model of information retrieval. In Lecture Notes In Computer Science - Research and Ad- vanced Technology for Digital Libraries - Proceedings of the second European Conference on Research and Advanced Technology for Digital Libraries: ECDL’98, pages 569–584. Springer Verlag, 1998. [Jelinek and Mercer, 1980] Frederick Jelinek and Robert L. Mercer. Interpolated estimation of markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice, 1980. [Lafferty and Zhai, 2001] J. Lafferty and C. Zhai. Prob- abilistic ir models based on document and query gen- eration. In B. Croft J. Callan and J. Lafferty, editors, Proceedings of workshop on Language Modeling and Information Retrieval, 2001. [Liu, 2009] Tie-Yan Liu. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):225–331, 2009. [Ponte and Croft, 1998] Jay M. Ponte and W. Bruce Croft. A language modeling approach to information retrieval. In SIGIR’98: Proceedings of the 21st annual interna- tional ACM SIGIR conference on Research and devel- opment in information retrieval, pages 275–281, New York, NY, USA, 1998. ACM Press. [Roelleke and Wang, 2008] Thomas Roelleke and Jun Wang. Tf-idf uncovered: a study of theories and proba- bilities. In SIGIR, pages 435–442, 2008. [Zhai and Lafferty, 2001] Chengxiang Zhai and John Laf- ferty. A study of smoothing methods for language mod- els applied to ad hoc information retrieval. In W. B. Croft, D. Harper, D. H. Kraft, and J. Zobel, editors, Pro- ceedings of the 24th Annual International Conference on Research and development in Information Retrieval, New York, 2001. ACM. 174 An Attribute-based Model for Semantic Retrieval Hany Azzam, and Thomas Roelleke School of Electronic Engineering & Computer Science Queen Mary University of London London, UK {hany,thor}@dcs.qmul.ac.uk Abstract This paper introduces a knowledge-oriented ap- proach for modelling semantic search. The mod- elling approach represents both semantic and tex- tual data in one unifying framework, referred to as the probabilistic object-relational content modelling framework. The framework facilitates the transformation of “term-only” retrieval mod- els into “semantic-aware” retrieval models that consist of semantic propositions, such as rela- tionships and classification of objects. To il- lustrate this facility, an attribute-based retrieval model, referred to as TF-IEF-AF-IDF, is instan- tiated using the modelling framework. The ef- fectiveness of the developed retrieval model is demonstrated using the Internet Movie Database test collection. Overall, the probabilistic object- relational content model can guide how semantic search and semantic data are modelled. 1 Introduction Modern retrieval systems have become more complex and semantic-aware by exploiting more than just the text, e.g. [Bast et al., 2007; Kasneci et al., 2008]. Nowa- days, large-scale knowledge bases can be automatically generated relatively easily from knowledge sources such as Wikipedia or other semantically explicit data reposito- ries such as ontologies and taxonomies that explain entities (e.g. mark-up of persons, movies and locations) and record relationships (e.g. bornIn and actedIn). Obstacles arise, however, when developing ranking functions and, in a broader sense, search strategies that combine query and document text with other types of evi- dence derived from semantic-rich knowledge bases. In par- ticular, it is challenging when the ranking function is im- plemented directly on top of a standard physical document representation, as in the standard information retrieval (IR) engineering approach [Cornacchia and de Vries, 2007]. Consequently, an alternative approach, or more ambi- tiously, an alternative standard, is required to reduce the complexity of building and maintaining information sys- tems and re-using their retrieval strategies [Fuhr, 1999; Hiemstra and Mihajlovic, 2010] Design re-use is particularly important since IR already has a well-established family of retrieval models, namely TF-IDF, BM25 and language modelling (LM) that are used in many tasks but in slightly different ways. Ideally, these standard retrieval models should be re-used and adapted to solve complex and semantic retrieval tasks, and overall, to maximise the benefit gained from the underlying data. How to link the the world of retrieval models with the world of semantic data, while avoiding an extensive engi- neering process, is not a straightforward process. However, the first step towards transferring the achievements of text retrieval models and so maximising the impact of semantic data, will be to simplify the process of tailoring search to a specific work task [Hawking, 2004]. This paper revisits a framework that may help to es- tablish a standard for developing semantic retrieval mod- els. The framework, referred to as the probabilistic object- relational content model (PORCM, [Roelleke, 1999]), is closely linked to engineering initiatives espoused in the development of business automation solutions using rela- tional database management systems: design a concep- tual schema; express the user application in terms of this schema; and design the user interface. Such a framework, thus, gives creative freedom to de- signers to invent and refine semantic-aware retrieval mod- els. It also allows for more than one semantic-based re- trieval model and avoids the need to propose a single re- trieval model for semantic search. As such, the framework provides a platform for developing effective semantic re- trieval models applicable to textual and semantic data. 1.1 Contributions & Structure The main contribution of this paper is to use a modelling framework to demonstrate a semantic variant of a standard retrieval model. The framework acts as a logical layer, which decouples the retrieval models from the physical rep- resentation of the data (document structure and content), bringing what the database field calls “data independence” to IR systems. A BM25 motivated semantic retrieval model is in- stantiated using the probabilistic object-relational content model. This particular model exemplifies how, by taking a knowledge-oriented approach, retrieval models which are traditionally designed for terms, i.e. for keyword-based re- trieval, can exploit terms and semantic evidence while en- suring data independence. The feasibility of developing workable retrieval models for semantic search and the effectiveness of the developed retrieval model is demonstrated on the Internet Movies Database (IMDB) collection. The remainder of this paper is organised as follows. Sec- tion 2 highlights some similarities and differences between structured document retrieval and semantic retrieval. The section also outlines related work in the literature. Sec- tion 3 details the probabilistic ORCM and its components. Section 4 showcases a retrieval model for semantic retrieval based on the proposed framework. Section 5 evaluates the retrieval model, and Section 6 concludes the discussion. 175 Structured Document Retrieval (SDR) Semantic Retrieval (SR) Queries Keyword-oriented, with structural components (e.g. XPATH with “contains” predicate) Knowledge-based, with keyword-based compo- nents Retrieval Unit Documents, Sections, etc: structural objects Documents, Actors, etc: any object Evidence Spaces Terms and Element Types (e.g. section, title) Terms, Class names, Classifications, Relation- ship names, Relationships, Attribute names, at- tributes Figure 1: Structured Document Retrieval versus Semantic Retrieval 2 Background and Related Work 2.1 Structured Document Retrieval & Semantic Retrieval Figure 1 illustrates some characteristics of structured doc- ument retrieval (SDR) and semantic retrieval. SDR is not limited to a particular structural markup lan- guage or a particular forum and so has general applicability. Our discussion, however, focuses on XML as the structural markup language because it is the most widely used stan- dard. Additionally, we focus on INEX-related work and contributions as INEX is characterised by research about SDR, and it is the initiative for the evaluation of XML re- trieval [Fuhr et al., 2002]. The first point of comparison is the type/formulation of queries. In SDR the user information need can be formulated using either text-only or text-and-structure ap- proaches. The text-only approach is keyword-based (e.g., the content-only approach in INEX [Amer-Yahia and Lal- mas, 2006]). The text-and-structure approach combines textual and structural clues (e.g. XPath, NEXI [Trotman and Sigurbjo¨rnsson, 2004]). Queries in semantic retrieval can be expressed using the aforementioned approaches in addition to the semantic structures found in the formulated query and/or the document representation (e.g. semantic content-and-structure approach [van Zwol and van Loos- broek, 2007]). Semantic queries can also be expressed us- ing graph-patterns such as SPARQL [Prud’hommeaux and Seaborne, 2006]. Extending SPARQL for full-text and se- mantic search (e.g. [Bast et al., 2007; Kasneci et al., 2008; Elbassuoni et al., 2009]) is analogous to the prior work on SDR which has enhanced XPath and XQuery by various forms of text-search and ranking capabilities. The graph- based approaches, in contrast to XML trees, are indepen- dent of the physical representation of the underlying data. If a query is expressed via XQuery, the application would have to know the particular data representation. The second aspect illustrated in Figure 1 is the retrieval unit (answer type). In SDR the document structure is ex- plicit and, therefore, the retrieval unit is based on the doc- ument’s presentation and logical structures, such as chap- ter and section. Semantic retrieval focuses instead on re- trieving objects which have a particular meaning. This is particularly evident in the discussion of query formulations when semantic text-and-structure topics are introduced to conduct semantic retrieval [van Zwol and van Loosbroek, 2007]. Therefore, in semantic retrieval, the answer type has a more general and semantic form, which includes objects such as a person, a product, or a project. The third difference is the evidence (ranking criteria) used in retrieval. Unlike in traditional IR, in SDR the ad- ditional structural evidence not seen in unstructured (flat) text documents is exploited. For example, in XML re- trieval the logical document structure is used to estimate the relevance of an element according to the evidence as- sociated with this element only. When the goal is to rank documents, the probabilities of estimating the rele- vance of each element are combined to produce a single probability for the document. There are two strands of models for element-based ranking and evidence combina- tion: variants of probabilistic models based on the prob- ability ranking principle, such as [Robertson et al., 2004; Lu et al., 2005]; and variants of the statistical language modelling technique proposed by [Ponte and Croft, 1998], such as [Ogilvie and Callan, 2002]. In semantic retrieval, structural elements can be also utilised to estimate the relevance of an object. However, these elements bear a semantic meaning (e.g. actor, direc- tor Figure 2) and, thus, a distinctive term distribution. For example, [Kim et al., 2009] extend [Ogilvie and Callan, 2002] to demonstrate how varying weights for different se- mantic elements across query terms can improve retrieval performance. Evidence associated with semantic annota- tions in the form of linguistic structures is also used in se- mantic retrieval, especially in question answering applica- tions. [Zhao and Callan, 2008; Bilotti et al., 2007] propose ranking answer-bearing sentences to questions by incorpo- rating the semantic annotations in both the sentences and queries into the retrieval process. Semantic annotations expressed in the form of Re- source Description Framework (RDF) graphs, referred to as entity-relationship graphs, are also used as ranking cri- teria. These graphs are used as a source of evidence to con- struct graph-based ranking models and queries for retriev- ing semantic objects. For example, [Kasneci et al., 2008; Elbassuoni et al., 2009] propose LM variants for ranking the results of keyword-augmented graph-pattern queries over entity-relationship graphs. Maximus is a powerful Roman general ... ... Russell Crowe Gladiator 2000 Ridley Scott Figure 2: XML-based Representation of a Movie In this paper we utilise full-text and semantic evidence for semantic retrieval. These types of evidence are rep- resented by a set of propositions: terms, classifications, relationships and attributes. For example, in Figure 2 the XML-based representation of a movie contains several types of evidence. If explicated according to the afore- mentioned set of propositions, then the term proposition would represent the full-text evidence, which is similar to the common keyword-based IR representation. The classi- 176 fication proposition would capture the “class of’ relation- ship between objects and classes (e.g. director “Ridley Scott”). The relationship proposition would associate two objects, and the attribute proposition would contain the re- lationship between an object and an atomic value. The aforementioned propositions stem from object- oriented and content modelling. They support the retrieval of structural elements, semantic elements and heteroge- neous objects. The information about the structure and the specifics of objects is represented in the unifying frame- work of text (terms) and object-oriented modelling (Sec- tion 3 discusses the framework in more details). 2.2 Related Work Related work can be found primarily in investigations of the XML retrieval task, which has been addressed from both IR and database perspectives. The proposed seman- tic retrieval model is akin to XML retrieval models such as BM25f [Robertson et al., 2004] and hierarchical language modelling [Ogilvie and Callan, 2003] in that it is based on combining different evidence spaces and probabilities. Our retrieval model is also similar to the models in [Kas- neci et al., 2008; Elbassuoni et al., 2009] in the sense that these models focus on retrieving semantic data. However, these approaches mainly propose a variant of only one tra- ditional retrieval model to answer semantic queries; more- over, they do not combine other sources of evidence such as full-text and/or structural elements. On systems-side, approaches such as ESTER [Bast et al., 2007] support a similar class of semantic queries as the proposed model; however, efficiency is their primary design goal. Our knowledge representation approach shares some as- pects with logical approaches for modelling IR such as MIRTL (Multimedia Information Retrieval Terminologi- cal Logic, [Meghini et al., 1993]) where “terms” are used to represent concepts and roles. In our framework, how- ever, content is considered separate from the concepts of object-oriented modelling. To implement the retrieval model designed using the proposed knowledge representa- tion, an integrated database and IR approach is used. This approach is similar to probabilistic database approaches found in [Dalvi and Suciu, 2004; Chaudhuri et al., 2006; Roelleke et al., 2008]. 3 Knowledge Representation This section proposes a schema component for semantic retrieval. The notion of a “schema” highlights the differ- ence between keyword-based and semantic retrieval where the former requires a search over only an inverted file struc- ture, while the latter requires several processing steps and different representations. The proposed schema is based on object-relational modelling principles. Traditionally, the object-relational model (e.g. [Stonebraker et al., 1998]) uses relations such as “memberOf(...)”, “relationship(...)” and “attribute(...)” to model concepts such as classifica- tion, relationships, and attributes. We extend this approach and introduce an object-relational content model. The model integrates object-relational modelling and content- oriented (term-based) modelling into one framework. Con- sequently, we extend the model to create its probabilis- tic variant, namely the probabilistic object-relational con- tent model. This model includes relations which represent probabilistic parameters to model IR-like retrieval models. 3.1 Object Relational Content Model (ORCM) Figure 3 uses the ORCM to represent the movie in Figure 2. Ellipses indicate that some data have been omitted to con- serve space. The location where different elements occur are stored as paths, expressed in XPath. For readability we use a simplified syntax, e.g., “imdb/movie 1/title 1” points to the attribute describing a movie’s title. term Term Context gladiator 329191/title[1] 2000 329191/year[1] russell 329191/.../actor[1] crowe 329191/.../actor[1] ridley 329191/.../director[1] scott 329191/.../director[1] maximus 329191/plot[1] powerful 329191/plot[1] roman 329191/plot[1] general 329191/plot[1] ... ... term doc Term Context gladiator 329191 2000 329191 russell 329191 crowe 329191 ridley 329191 scott 329191 maximus 329191 powerful 329191 roman 329191 general 329191 ... ... (a) Term propositions in the element and root contexts classification ClassName Object Context movie 329191 movies[1] title 329191/title[1] 329191 year 329191/year[1] 329191 actors 329191/actors[1] 329191 actor 329191/.../actor[1] 329191/actors[1] team 329191/team[1] 329191 director 329191/.../director[1] 329191/team[1] plot 329191/plot[1] 329191 ... ... ... (b) Classification propositions attribute AttrName Object Vatlue Context id 329191 “329191” movies[1] id 329191/.../actor[1] “russel...” 32.../actors[1] id 329191/.../director[1] “ridley...” 32.../team[1] ... ... ... ... (c) Attribute propositions Figure 3: An Object-Relational Content Model Represent- ing a Movie Traditionally, in order to model the task of document retrieval, a term-document representation based on a data structure such as “term(Term, DocId)” would suffice. For example, in Figure 2 the terms in the XML fragment can have a flat representation (the XML elements are not in- terpreted), such as “term(movie,329171)”. Consequently, for document retrieval the retrieval models (e.g., TF-IDF, language modelling, BM25) for ranking documents are pri- marily based on a “term(Term,DocId)” relation. In the case of SDR, a structural representation, such as “term(Term, SecId)”, is necessary. This is because the con- texts or the document structure (e.g. abstract, section, para- graph) are explicit. Additionally, the retrieval models are usually based on combining the scores obtained from scor- ing every context, or combining the term frequencies. What then, are the data structures which we use to model (imple- ment) semantic retrieval? We first review the design process of the ORCM [Roelleke, 1999] and then demonstrate how it can be utilised for semantic retrieval. 177 The ORCM combines object-oriented and content-based modelling concepts. The object-oriented concepts include classification, relationships and attributes, which are more generally referred to as propositions – a specification stem- ming from object-oriented modelling and terminological logics [Meghini et al., 1993]. Content modelling is anal- ogous to the traditional IR representation of text, which is usually a term-context-based representation. However, unlike the conventional modelling approaches for IR, such as terminological logic, [Meghini et al., 1993]), “content” is viewed as separate from the concepts of object-oriented modelling. This separation helps content to be described in a more formal and knowledge-oriented way. There are two design steps taken in order to achieve this separation. Each predicate within a proposition is associ- ated with a context1 and a term proposition is used as the keyword-based IR representation for text. Other propositions (components) that can be taken into account include generalisation and aggregation, where gen- eralisation is a relationship between classes, and aggrega- tion is a particular relationship between objects (entities). However, it is the four aforementioned propositions with which we shall be mainly concerned here. The pillars of the probabilistic object-relational content model can be sum- maries as follows: • classification of objects: monadic predicate of the form “ClassName(Object)”, for example: “actor(russell crowe)”. • relationship between objects: dyadic predicate of the form “RelationshipName(Subject,Object)”, for exam- ple: “directedBy(329191,ridley scott)”. • attribute of objects: dyadic predicate of the form “AttributeName(Object, Value)”, for example: ridley scott.name(“Ridley Scott”). In order to implement the object-oriented modelling con- cepts, a relational approach is used, resulting in the pro- posed object-relational schema. Relations such as “classi- fication” and “relationship” are devised. Moreover, in or- der to join object-oriented modelling with keyword-based and content-oriented modelling, an additional predicate, namely “term”, is used, and an additional attribute column, namely “Context” is adjoined to term, classification, rela- tionships and attributes. This yields the ORCM modelling paradigm. Below we contrast the conventional object-relational model with the object-relational content model2. Object-relational modelling (ORM): • classification(ClassName, Object) • relship(RelshipName, Subject, Object) • attr(AttrName, Object, Value) Object-relational content modelling (ORCM): • classification(ClassName, Object, Context) • relship(RelshipName, Subject, Object, Context) • attr(AttrName, Object, Value, Context) • term(Term, Context) 1Context is a general concept that refers to documents, sec- tions, databases or any other object with a content 2In the schema design process we often opt to use shorter names for relation and attribute names so that to achieve a slim- mer form of the schema. Terms are complementary to classification and relation- ships. Most importantly, and one of the main emphases of this modelling paradigm, is that content is not modelled, for example, as a relationship “contains(DocumentId, Term)”, but rather content is modelled by maintaining an attribute column “Context” in the schema for each proposition. In other words, content is modelled separately from existing concepts, such as classification and relationships. There are several fundamental benefits of utilising the object-relational content model. One benefit is that knowl- edge modelling, in general, aids “knowledge architects” to build information and knowledge management systems that are both flexible and scalable. Another benefit is that it fa- cilitates the transformation of term-document-based IR re- trieval models into retrieval models founded on the prob- abilistic object-relational content model, thus resulting in a strand of retrieval models suited for semantic retrieval. Lastly, the model enables the representation of textual, structural and semantic data in one unifying framework. The uniform representation of the data, the semantic re- trieval models and the decoupling between the two using the object-relational content model results in data indepen- dence. This is a desirable feature when designing complex retrieval systems. In summary, the ORCM is to be understood as a concep- tual model with a set of relations – a relation for each basic concept of object-oriented and content-based modelling. 3.2 Probabilistic Spaces for Semantic Retrieval The evidence space (ranking criteria) construction is fa- cilitated by the probabilistic object relational content model. The probabilistic ORCM comprises the relations of the ORCM, as well as relations representing prob- abilistic parameters. For example, for the basic rela- tion “term doc(Term,Doc)”, there can be term-based and document-based probabilities. Some of the probabilistic relations for “term doc” are: p DF t term doc(T): Document frequency-based probability of term t derived from relation “term doc(Term,Doc)” p TF t term doc(T): Tuple frequency-based probability of term t derived from relation “term doc(Term,Doc)” pidf term doc(T): Inverse document frequency (IDF)- based probability of term t derived from relation “term doc(Term,Doc)” In probabilistic ORCM the techniques and models of IR devised for term-based retrieval models become available for class-based, relationship-based and attribute-based re- trieval. Concepts such as the tuple frequency-based prob- ability of a class (class-frequency, CF) and the IDF of a class name (similar to the IDF of a term) make immediate sense. Similarly, the tuple frequency-based probability of an attribute name (attribute name-frequency, AF) and the IDF of an attribute name become possible. The ability to transfer the achievements of term-based retrieval directly to semantic retrieval models makes the probabilistic ORCM a potential base for semantic retrieval. Moreover, the way probabilistic spaces can be combined in probabilistic ORCM can lead to new and effective re- trieval models. For example, the frequencies of attributes names are exploited to define an attribute-based retrieval status value (RSV), and this RSV can be combined with other RSV’s, such as the term-based one. The next section provides an example of a retrieval model constructed on top of the discussed representation. 178 4 An Attribute-based Retrieval Model The proposed model, TF-IEF-AF-IDF, for semantic re- trieval focuses on combining evidences from attribute name and term predicate spaces. The model is implemented in three phases. Figure 4 (Page 6) is a snapshot of the pro- posed model’s components when answering query number 28, “gladiator action maximus scott”, from the IMDB test collection. We detail the three phases below. Phase 1 retrieves “Term-Document’s Element-Query” triplets for each query term. Such retrieval can be per- formed using any term-based retrieval model (e.g. TF-IEF, TF-IDF, BM25, LM). We choose here TF-IEF, where TF is the within-element term frequency, and IEF is the inverse element frequency of a term. TF-IEF is defined as follows: Definition 1 TF-IEF: RSVTF-IEF(e, q) := ∑ t∈e∩q TF(t, e) · IEF(t) (1) “t” stands for term, “e” for element, such as “title”, and IEF is inverse element frequency. Phase 2 consists of two parts. The first part infers the attribute name and root context (root nodes) from the document’s elements in the “Term-Document’s Element- Query” triplet. This yields an intermediate and query- dependent attribute-based index (this index corresponds to the tf ief match augmented in Figure 4). The second part infers for each query term its top-k cor- responding “context type”. For example, for a query such as “fight brad pitt” the inferred top–1 context type would be “title” for query term “fight” and “actor” for query terms “brad” and “pitt”. This is because “fight” occurs in the con- text of type “title”, and “brad” and “pitt” occur in the con- text of type“actor”. The attribute-based index constructed in the first part of Phase 2 is used to infer the mapping be- tween each query term and its type. The result of this infer- ence is represented in Figure 4 (“AttrName” corresponds to the context type in which the query term occurs). qTermAttr(Term, AttrName, Query) The probability of the mapping between a query term and an attribute type is estimated using the number of map- pings between a term and an attribute name divided by the total number of mappings in the intermediate index. The intuition behind the mapping is that if a term occurs fre- quently within a certain context type then the term is more likely to be “characterised” by that particular context [Kim et al., 2009]. Phase 3 combines an attribute-based retrieval score with a traditional topical (term-based) score resulting in the TF- IEF-AF-IDF model. The motivation to do so is that in some cases an attribute-only retrieval score is deemed unsuitable for estimating the relevance of a particular semantic ob- ject with respect to a query. In other words, not all queries are issued with a particular semantic predicate or relation- ship in mind which the query terms can be mapped to. Therefore, a document-based retrieval score would provide a more realistic setting whereby both the attribute-based and document-based retrieval scores are considered. The intuition behind this combination is comparable to the mix- ture of document-based and element-based language model scores [Zhao and Callan, 2008]. We formally define the model and its components below. Definition 2 The TF-IEF-AF-IDF Model: TF-IEF-AF- IDF model is a multi-stage retrieval model. In the first stage, TF-IEF is used to associate for each query term the document elements and query. Also, each query term is associated with an attribute name. Then, the RSV’s of attribute-based retrieval and term-based retrieval are com- bined into an overall score. Let d′ be the document inferred from d, where the infer- ence assigns several attribute names (context types) to each “Term-Document’s Element-Query” generated in phase 1. Let q′ be the query inferred from q, where the infer- ence assigns several attribute names (context types) to each query term. RSVTF-IEF-AF-IDF(d, q) := (2) RSVTF-IDF(d, q, term-based-index-all-docs) + RSVAF-IDF(d ′, q′, attribute-based-index-retrieved-docs) The term-based score is defined as follows: Definition 3 TF-IDF: RSVTF-IDF(d, q) := ∑ t∈d∩q TF(t, d) · IDF(t) (3) The attribute-based score is defined as follows: Definition 4 AF–IDF: RSVAF-IDF(d, q) := (4)∑ a∈d∩q AF(a, d) · IDF(a) AF(a, d) and TF(t, d) correspond to the within- document attribute name frequency and the within- document term frequency components, respectively. “aName” is an attribute name, “d” is a document and “q” is the query. The frequencies are estimated using BM25’s TFK(t,d) (tfd/(tfd+Kd)) quantification [Robert- son, 2004]. tfd is total frequency and K is a normalisation factor reflecting the document length. In the attribute-based aggregation it is afd instead of tfd and K is the number of attributes in the intermediate index. The IDF in the AF-IDF is calculated over the set of retrieved documents only. Figure 4 demonstrates for the IMDB’s query number 28 the processing steps and relational instantiations of the TF- IEF-AF-IDF model (see Section 5 for details on the IMDB collection). “tf ief match” represents the retrieved Term- DocumentElement-Query triplets for each query term. “tf ief match augmented” infers attribute names and root contexts for each Term-DocumentElement-Query triplet, and “attr index” is the representation of the attribute names and root contexts. “qTermAttr” represents the query term and its inferred attribute name. Lastly, “AF-IDF” repre- sents the predicate-based attribute retrieval scores and “TF- IEF-AF-IDF” represents the combined predicate-based at- tribute and term retrieval scores. 4.1 Discussion The model described above integrates several models into a stepwise retrieval process. Such a retrieval process is sim- ilar to “database matching” which is achieved in several steps, using different representations at each database level. The stepwise approach is more complex than simple flat text matching and creates opportunities for more powerful matching specifications for semantic search. The framework ensures that traditional models such as TF-IDF and aggregation techniques, such as BM25’s TFK quantification are transferrable to models where semantic knowledge is explicated, such as AF-IDF. 179 tf ief match P (t, e|q) Term Retrieved Element Query 0.774433 gladiator imdb/movie 128903/title[1] q028 ... ... ... q028 0.392531 action imdb/movie 113798/genre[1] q028 ... ... ... q028 0.426180 maximus imdb/movie 2112/plot[1] q028 ... ... ... q028 0.187669 scott imdb/movie 284995/team[2] q028 (a) Element-based Retrieval Results tf ief match augmented: inferred attributes and contexts Prob Term Retrieved Element Attribute Name Root Context Query 0.77 gladiator imdb/movie 128902/title[1] title imdb/movie 128902 q028 . . . . . . . . . . . . . . . q028 0.39 action imdb/movie 113798/genre[1] genre imdb/movie 113798 q028 . . . . . . . . . . . . . . . q028 0.43 maximus imdb/movie 2112/plot[1] plot imdb/movie 2112 q028 . . . . . . . . . . . . . . . q028 0.19 scott imdb/movie 284995/team[2] team imdb/movie 284995 q028 (b) Augmented Retrieval Result: Inferred Attributes Names and Root Contexts attr index(AttrName, Context) Prob Attribute Name Context Query 0.77 title imdb/movie 128902 q028 . . . . . . . . . q028 0.39 genre imdb/movie 113798 q028 . . . . . . . . . q028 0.43 plot imdb/movie 2112 q028 . . . . . . . . . q028 0.19 team imdb/movie 284995 q028 (c) Attribute-based Index idf: over attr index P(aName) Attribute Name Query . . . . . . q028 0.48 title q028 0.31 plot q028 0.14 team q028 0.09 genre q028 . . . . . . q028 (d) IDF of attribute names over attr index qTermAttr (IMDB query 28) Prob Term Attribute Name Query 0.000426942 gladiator title q028 ... ... ... q028 0.616951 action genre q028 ... ... ... q028 0.00000521851 maximus plot q028 ... ... ... q028 0.14484 scott team q028 (e) Query Representation: Terms and Inferred Attributes AFK–IDF Prob Context . . . . . . 0.17 imdb/movie 128902 . . . . . . 0.27 imdb/movie 128908 (f) Attribute-Frequency-based Score TF-IEF-AF-IDF RSV Context 0.33 imdb/movie 128902 0.32 imdb/movie 128908 . . . . . . . . . . . . (g) TF-IEF-AF-IDF Score Figure 4: TF-IEF-AF-IDF Retrieval Phases (IMDB Query 28) 180 In particular, the attribute-based aggregation (the AFK component) is instrumental to the performance of the TF- IEF-AF-IDF retrieval model as will be illustrated in the evaluation section. The query terms and the retrieved ele- ments (Phase 2) are mapped to their corresponding seman- tic predicates, which, in this case, are semantic attributes names. This mapping, then, results in an aggregation over the attribute names instead of the terms. Our analysis suggests that this shift, hence, leads to an event space that contains less number of distinct events (at- tribute names) but that occur frequently. Such a feature is well-suited to the BM25-like aggregation of frequencies because if an event occurs in a context then the probability it occurs again is greater than the initial probability, i.e. the occurrence of an event depends on previous occurrences – [Wu and Roelleke, 2009] have proposed a probabilistic se- mantics for this feature, referred to as “semi-subsumed”. The non-linear nature of the aggregation is key to the good retrieval quality of TF-IEF-AF-IDF. To illustrate the proposed retrieval model we used an XML-based collection. Particular to this collection is that each element type has specific semantics and, thus, a dis- tinctive term distribution. This is analogous to, for exam- ple, entity relationship graphs where the semantics of the data is is represented rather than the structural layout. However, unlike its full-fledged semantic counterpart, the XML data “as it comes” does not explicate relationships between entities. This is reflected in the ORCM represen- tation as there are mainly terms, classification and attribute relations. Furthermore, the “basic” representation of the XML data still uses XPath expressions to denote object Id’s and contexts. This can be viewed as problematic since the main aim here is to achieve a semantic as opposed to struc- tural representation and eventually semantic as opposed to structural retrieval models. There are two main solutions to this problem. The first is that the XML data in itself consists of element types that have specific semantics and therefore differ from logical or layout element types that are concerned with a docu- ment’s or a page’s presentation. Secondly, the structural representation can be lifted to become a semantic repre- sentation. The following Datalog rule exemplifies how this can be done. The rule for “actor” underlines that a “se- mantic” object can be extracted by combining structural in- formation about elements of type actor and their attributes (e.g. “russell crowe” in Figure 2). actorElement(XPath, Context) :− classification ( actor , XPath, Context ); actorEntity (ObjectId , Context) :− actorElement(XPath, Context) & attribute ( id , XPath, ObjectId , Context ); The above rules derive a semantic Id for an actor; more- over, it lifts the “structural” into a “semantic” classification. In a similar fashion, attributes can be lifted to become re- lations in “higher-level” layers of the ORCM. These lay- ers can be derived from the “basic” ORCM and form an abstraction hierarchy from basic to structural to semantic schema. This helps to achieve data independence, as any data (XML, RDF, RSS) can be represented in the basic ORCM, and then, application-specific relations are derived. Overall, this discussion emphasises that the ORCM schema supports reasoning over structural elements, se- mantic structures and, eventually, semantic information which is “naturally” present in entity relationship graphs. 5 Evaluation The purpose of the evaluation is two-fold. Firstly, it proves the feasibility and applicability of the proposed knowledge representation, namely the probabilistic object relational model, for both term-based and semantic retrieval. Sec- ondly, it investigates the quality of the proposed retrieval model, which is one instance of the proposed knowledge representation. MAP RecipRank TFK–IDF 35.07 36.80 TF-IEF-AF-IDF-top-1 52.11 53.96 TF-IEF-AF-IDF-top-5 60.32 62.04 Improvement +71.19 +68.59 Figure 5: Retrieval Performance per Query Mapping (bold- face indicate best performing model and results are in per- centages) The experiment was performed on the IMDB collec- tion3, which consists of 437, 281 documents or XML records. Each document corresponds to a movie and was constructed from text data. The element types were “title”, “year”, “releasedata”, “language”, “genre”, “country”, “lo- cation”, “colorinfo”, “cast”, “team” and “plot”. Document content consists mostly of keywords, with the exception of the plot element. We utilised the 40 queries and relevance criteria in [Kim et al., 2009]. For each query, a query term is mapped to its corresponding semantic structure. This leads to a set of queries that contain keywords and semantic predicates (attributes). The unit of retrieval for all queries is the movie object. Below is a logical representation of query number 28 with top–1 mapping. retrieve (X) :− X. title ( gladiator ) & X.genre(action ) & X.actor (maximus) & X.plot(maximus) & X.director( scott ); For the experiments, we used HySpirit [Roelleke et al., 2001], a probabilistic reasoning system which supports the retrieval of text and (semi-)structured data. We chose HySpirit because it provides a framework with high-level and customisable concepts for modelling retrieval models. The framework provides an open-box approach for describ- ing ranking models for any object. Figure 5 shows the retrieval effectiveness for the test queries on the IMDB collection. To conserve space, only the performance of TF-IEF-AF-IDF with “top–1” and “top–5” mapping has been reported. The main observa- tion is that the proposed method, an attribute-based model for semantic retrieval, significantly outperforms (p-value < 0.01 with two-tailed t-test) the TFK–IDF baseline. The baseline is a document-oriented retrieval model where the XML elements are discarded (similar to the method re- ported in [Theobald et al., 2005]). The improvement performance can be accredited to the combination of term-based and attribute-based evidence spaces. Furthermore, the TF and AF parameters of the model are set to the BM25-like quantification that deliv- ers the best performance since it mitigates the sub-optimal independence assumption of the total count. The AF component, in particular, reflects that if a term occurs in a document then the probability that it occurs 3http://www.imdb.com/interfaces#plain 181 again is greater than the initial probability, i.e. the oc- currence of an event depends on previous occurrences. Switching from term to attribute space, which groups terms under a particular context type, is conducive to retrieval performance. Overall, the evaluation demonstrates that the expressive- ness of the probabilistic ORCM model can lead to an effec- tive model for semantic retrieval. 6 Summary & Conclusion This paper demonstrates an approach for representing knowledge, how to merge object-relational and term- oriented modelling and how to steer term-based modelling towards semantic modelling (the semantic knowledge is ex- plicit); therefore, it contributes a discussion of how object- relational modelling meets content modelling and its effect on probabilistic retrieval models for semantic retrieval. Semantic retrieval requires models that, in sound and transparent ways, mix various frequencies and probabili- ties. Whereas in text retrieval, probabilities of terms are the dominating players, in semantic retrieval, probabilities of terms, classes, relationships, attributes and objects are the parameters involved in the design of a retrieval model. This paper introduces a particular retrieval model, namely TF-IEF-AF-IDF, in which the attribute frequency of re- trieved elements (attributes) is a crucial component of rank- ing retrieved objects (contexts). The AF component gath- ers evidence from different query terms into one attribute. This aggregation has a positive outcome on retrieval qual- ity, especially when combined with traditional term-based retrieval, as shown in this paper for TFK–IDF. The proposed model is one instance of a large fam- ily of models that can be developed using the probabilis- tic object relational content framework. The framework helps to transform retrieval models that are traditionally de- signed for terms, i.e. for keyword-based retrieval to models that are based on propositions and, hence, tailored towards more complex and semantic retrieval tasks. Furthermore, the flexibility and openness of the framework encourages engineers to create a variety of retrieval models that com- bine textual, structural and semantic sources of evidence. We have contributed to two related facets of semantic retrieval: knowledge representation and retrieval strategy modelling. Future work will investigate other retrieval models for semantic retrieval based on the probabilistic object-relational content model. 7 Acknowledgments We would like to thank Jinyoung Kim of the University of Massachusetts Amherst for providing us with the collection and the queries. We would also like to thank the reviewers for their excellent suggestions. References [Amer-Yahia and Lalmas, 2006] Sihem Amer-Yahia and Mounia Lalmas. Xml search: languages, inex and scoring. SIGMOD Rec., 35(4):16–23, 2006. [Bast et al., 2007] Holger Bast, Alexandru Chitea, Fabian M. Suchanek, and Ing- mar Weber. Ester: efficient search on text, entities, and relations. In SIGIR, 2007. [Bilotti et al., 2007] Matthew W. Bilotti, Paul Ogilvie, Jamie Callan, and Eric Ny- berg. Structured retrieval for question answering. In SIGIR, pages 351–358, 2007. [Chaudhuri et al., 2006] Surajit Chaudhuri, Gautam Das, Vagelis Hristidis, and Gerhard Weikum. Probabilistic information retrieval approach for ranking of database query results. ACM Trans. Database Syst., 31(3):1134–1168, 2006. [Cornacchia and de Vries, 2007] R. Cornacchia and A. P. de Vries. A Parame- terised Search System. In Proceedings of the European Conference on IR Re- search (ECIR), 2007. Best student paper award. [Dalvi and Suciu, 2004] Nilesh N. Dalvi and Dan Suciu. Efficient query evaluation on probabilistic databases. In VLDB, pages 864–875, 2004. [Elbassuoni et al., 2009] Shady Elbassuoni, Maya Ramanath, Ralf Schenkel, Marcin Sydow, and Gerhard Weikum. Language-model-based ranking for queries on rdf-graphs. In CIKM ’09: Proceeding of the 18th ACM conference on Information and knowledge management, pages 977–986, New York, NY, USA, 2009. ACM. [Fuhr et al., 2002] Norbert Fuhr, Norbert Goevert, Gabriella Kazai, and Mounia Lalmas. Inex: Intitiative for the evaluation of xml retrieval. In ACM SIGIR Workshop on XML and Information Retrieval, 2002. [Fuhr, 1999] N. Fuhr. Towards data abstraction in networked information retrieval systems. Information Processing and Management, 35(2):101–119, 1999. [Hawking, 2004] David Hawking. Challenges in enterprise search. In ADC, pages 15–24, 2004. [Hiemstra and Mihajlovic, 2010] Djoerd Hiemstra and Vojkan Mihajlovic. A database approach to information retrieval: The remarkable relationship between language models and region models. Technical Report arXiv:1005.4752, May 2010. Comments: Published as CTIT Technical Report 05-35. [Kasneci et al., 2008] Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, and Gerhard Weikum. Naga: Searching and ranking knowl- edge. In ICDE, pages 953–962, 2008. [Kim et al., 2009] Jinyoung Kim, Xiaobing Xue, and W. Bruce Croft. A probabilis- tic retrieval model for semistructured data. In ECIR, pages 228–239, 2009. [Lu et al., 2005] Wei Lu, Stephen E. Robertson, and Andrew MacFarlane. Field- weighted xml retrieval based on bm25. In INEX, pages 161–171, 2005. [Meghini et al., 1993] C. Meghini, F. Sebastiani, U. Straccia, and C. Thanos. A model of information retrieval based on a terminological logic. In R. Korfhage, E. Rasmussen, and P. Willett, editors, Proceedings of the Sixteenth Annual Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval, pages 298–308, New York, 1993. ACM. [Ogilvie and Callan, 2002] Paul Ogilvie and Jamie Callan. Language models and structured document retrieval. In INEX Workshop, pages 33–40, 2002. [Ogilvie and Callan, 2003] P. Ogilvie and J. Callan. Language models and struc- tured document retrieval, 2003. [Ponte and Croft, 1998] J.M. Ponte and W.B. Croft. A language modeling approach to information retrieval. In W. Bruce Croft, Alistair Moffat, C. J. van Rijsbergen, Ross Wilkinson, and Justin Zobel, editors, Proceedings of the 21st Annual Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval, pages 275–281, New York, 1998. ACM. [Prud’hommeaux and Seaborne, 2006] Eric Prud’hommeaux and Andy Seaborne. SPARQL Query Language for RDF. Technical report, W3C, 2006. [Robertson et al., 2004] Stephen E. Robertson, Hugo Zaragoza, and Michael J. Taylor. Simple BM25 extension to multiple weighted fields. In CIKM, pages 42–49, 2004. [Robertson, 2004] S.E. Robertson. Understanding inverse document frequency: On theoretical arguments for idf. Journal of Documentation, 60:503–520, 2004. [Roelleke et al., 2001] Thomas Roelleke, Ralf Luebeck, and Gabriella Kazai. The HySpirit retrieval platform, demonstration. In Bruce Croft, David J. Harper, Don- ald H. Kraft, and Justin Zobel, editors, Proceedings of the 24th Annual Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, USA, New York, August 2001. ACM. [Roelleke et al., 2008] Thomas Roelleke, Hengzhi Wu, Jun Wang, and Hany Az- zam. Modelling retrieval models in a probabilistic relational algebra with a new operator: the relational Bayes. VLDB J., 17(1):5–37, 2008. [Roelleke, 1999] Thomas Roelleke. POOL: Probabilistic Object-Oriented Logical Representation and Retrieval of Complex Objects. Shaker Verlag, Aachen, 1999. Dissertation. [Stonebraker et al., 1998] Michael Stonebraker, Dorothy Moore, and Paul Brown. Object-Relational DBMSs: Tracking the Next Great Wave. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1998. [Theobald et al., 2005] Martin Theobald, Ralf Schenkel, and Gerhard Weikum. Topx and xxl at inex 2005. In INEX, pages 282–295, 2005. [Trotman and Sigurbjo¨rnsson, 2004] Andrew Trotman and Bo¨rkur Sigurbjo¨rnsson. Narrowed extended xpath i (nexi). In INEX, pages 16–40, 2004. [van Zwol and van Loosbroek, 2007] Roelof van Zwol and Tim van Loosbroek. Ef- fective use of semantic structure in xml retrieval. In ECIR, pages 621–628, 2007. [Wu and Roelleke, 2009] Hengzhi Wu and Thomas Roelleke. Semi-subsumed events: A probabilistic semantics for the BM25 term frequency quantification. In ICTIR (International Conference on Theory in Information Retrieval). Springer, 2009. [Zhao and Callan, 2008] Le Zhao and Jamie Callan. A generative retrieval model for structured documents. In CIKM ’08: Proceeding of the 17th ACM conference on Information and knowledge management, pages 1163–1172, New York, NY, USA, 2008. ACM. 182 Binary Histograms for Resource Selection in Peer-to-Peer Media Retrieval Daniel Blank and Andreas Henrich University of Bamberg D-96052, Bamberg, Germany {daniel.blank | andreas.henrich}@uni-bamberg.de Abstract With the ever increasing amount of media data and collections on the world wide web and on private devices arises a strong need for adequate indexing and search techniques. Trends such as personal media archives, social networks, mo- bile devices with huge storage space and net- works with high bandwidth capacities make dis- tributed solutions and peer-to-peer (P2P) systems attractive. Here, resource selection can be ap- plied to determine a ranking of promising re- sources based on descriptions of their content. Resources are contacted in ranked order to re- trieve appropriate media items w.r.t. a user’s in- formation need. In this paper we apply and adapt resource de- scriptions in the form of binary histograms and corresponding selection techniques which were designed for low-dimensional spatial data to high-dimensional data in the context of content- based image retrieval (CBIR). W.r.t. related work in distributed information retrieval, which is also discussed in this paper, a main characteristic of our approach are more space efficient resource descriptions. This makes them applicable for a wider range of application fields apart from the P2P domain. 1 Introduction In recent years, there has been a tremendous increase in (personal) web data. Web users maintain blogs, twitter their lives and upload photos and videos to social media sites. Besides storing media items, people tend to share them with friends and interact with each other by collab- oratively tagging or commenting on various items. Con- sequently, heterogeneous online resources which differ in size, media type and update characteristic have to be ad- ministered [Thomas and Hawking, 2009]. Hence, effective and efficient retrieval techniques are essential. Several criteria can be employed for the retrieval of me- dia items (cf. Fig. 1): a) textual content b) geographic foot- prints, c) timestamps and d) (low-level) audio or visual con- tent information. Based on these criteria, text, image, audio and video documents can be indexed and searched. Peer-to-peer (P2P) scenarios for the administration of media collections are attractive for multiple reasons. Me- dia items can reside on individual devices without a need to store them on remote servers hosted by service providers. b) geographic footprint c) timestamp d) audio/visual content a) textual content media item Figure 1: Possible criteria for media retrieval. Besides reducing dependency from service providers as in- formational gatekeepers, no expensive infrastructure has to be maintained by applying a scalable P2P protocol such as Rumorama [Müller et al., 2005b]. Crawling, which con- sumes large amounts of web traffic [Bockting and Hiem- stra, 2009], can thus be avoided. Idle computing power in times of inactivity can be used to maintain, analyze and en- rich media items. Our work focuses on space efficient resource descrip- tion and corresponding selection techniques which allow for efficient and effective query processing. As a proof- of-concept, we design them for the use within Rumorama without limiting their possible application. The resource description and selection techniques can also be applied in the context of traditional distributed information retrieval (IR) (cf. Sect. 2.1) or other variants of P2P IR systems (cf. Sect. 2.2). Furthermore, there is a range of possible appli- cation fields apart from P2P IR systems (cf. Sect. 2.3). Rumorama is a scalable P2P protocol that builds hier- archies of PlanetP-like [Cuenca-Acuna et al., 2003] P2P networks. In Rumorama, every peer sees a portion of the network as a single, small PlanetP network and furthermore maintains connections to other peers that see other small PlanetP networks. To this end, the peer stores a small set of links pointing to neighboring peers in other subnets in order to be able to forward queries beyond the boundaries of its own PlanetP-like subnet. Each peer can choose the size of its PlanetP network according to local processing power and bandwidth capacity. Within its small PlanetP- like subnet, a peer knows resource descriptions of all other peers’ data in the same subnet. These descriptions are dis- seminated by randomized rumor spreading and provide the basis for query routing decisions, i.e. which peers to con- tact in the local subnet during query processing. Peers storing media items which are described by the cri- teria outlined in Fig. 1 can thus be summarized by corre- sponding resource or peer descriptions (cf. Fig. 2), where each peer is considered as a resource of potentially use- 183                                      Figure 2: Peer or resource descriptions based on media item features. ful media items with respect to the given query. These peer or resource descriptions can be envisaged as an ag- gregation of the features of the media items stored on the respective peer. Resource description and selection for tex- tual data has already been extensively discussed in liter- ature (cf. [Cuenca-Acuna et al., 2003]). Techniques for time and date information are presumably less challenging and might consist of a combination of clustering (cf. [Duda et al., 2000]) and histogram techniques (cf. [Ioannidis, 2003]). We proposed resource description and selection tech- niques for geographic data in [Blank and Henrich, 2009; 2010]. Techniques for CBIR were e.g. addressed in [Blank et al., 2007]. In this paper we will apply and adapt ultra fine-grained summaries (UFS, cf. Sect. 3), the most promis- ing technique in the context of geographic data based on binary histograms, for CBIR and analyze its use in more detail. The contribution of this paper is i) the application and adaptation of UFS for CBIR, ii) a detailed analysis of summary sizes in the case of CBIR, and iii) an analysis of time complexity for peer ranking. Of course, resource ranking based on a single criterion is only a first step. When querying for multiple criteria, e.g. for a sunset image in a certain geographic region, criterion- specific resource rankings can be combined applying a merging algorithm for ranked lists (cf. [Belkin et al., 1995; Ilyas et al., 2008]). The remainder of this paper is organized as follows. Sect. 2 discusses related work. In Sect. 3 we present an analysis of the application and adaptation of UFS for CBIR. Sect. 4 concludes with an outlook on future work. 2 Related Work 2.1 Traditional Distributed IR Systems Traditional distributed IR is mainly concerned with text data. Resources are usually described by the set of terms (or a subset of them) which are contained in the documents of a resource and some kind of frequency information per term plus possibly additional statistics [Callan, 2000]. There is plenty of work on resource selection in this context (references are e.g. given in [Thomas and Hawking, 2009; Bockting and Hiemstra, 2009]). We will now describe approaches which address re- source description and selection based on low-level visual content features. Chang et al. [Chang and Zhang, 1997; Chang et al., 1997] propose three approaches. For all of them, a relatively small number of feature vectors from reference images is used (so called templates or icons). In contrast to our approach (cf. Sect. 3), templates are selected from the underlying data collection, i.e. the images that are administered by the resources. For all three approaches, a set of matching templates is computed per image and per query based on a predefined similarity threshold. The first approach uses the number of images which are assigned to a certain template as a ranking criterion. The second approach additionally applies mean and variance informa- tion w.r.t. the similarity values between a template and the images which are assigned to it. In experimental studies, both approaches perform worse than a histogram-based ap- proach. Here, a special form of clustering is applied in or- der to further partition the feature space covered by a tem- plate. This approach is e.g. different to our approach w.r.t. the type of clustering, the mechanism for computing the histogram (more parameters needed, no compression, no external collection), and especially the number of applied reference points. [Berretti et al., 2004] apply a special form of hierarchi- cal clustering to a resource’s set of feature vectors in order to ascertain a resource description. A predefined maximum cluster radius is used for determining the centroids which are included in the resource description. Every path in the clustering tree is descended as long as the cluster radius of a node is bigger than the maximum cluster radius. Amongst other information, the centroids of the nodes where the search stops are included in the resource description. By varying the maximum cluster radius, the granularity and size of the resource descriptions can be adapted. When it comes to resource selection, centroids and cluster radii are applied. Compared to our approach, the size of the resource descriptions is expected to be bigger with smaller potential for compression, since centroids are usually represented by d-dimensional real-valued feature vectors (d often between 102 and 103). We proposed similar approaches in [Al- lali et al., 2008] where local clustering and Gaussian mix- ture models (GMMs) are compared. For local clustering d-dimensional cluster centroids are included in the sum- maries. Mean and variance vectors of dimensionality d plus size information capturing the number of images of a peer which lie in a certain cluster are used as summaries in case of GMMs. GMMs perform better than local clustering in terms of ranking selectivity, but cannot outperform the ap- proaches presented in Sect. 3. In addition, average sum- mary sizes are expected to be bigger since the summariza- tion of a peer with only a single real-valued image feature vector and thus one centroid will consume d · 4 bytes for local clustering and (2d + 1) · 4 bytes for GMMs, if we assume the usage of 4 bytes per information unit. [Kim et al., 2002] apply a multi-dimensional selectiv- ity estimation approach based on compressed histograms (cf. [Lee et al., 1999]) for resource description and selec- tion. The d-dimensional feature space is partitioned based on a uniform, multi-dimensional grid. The histogram cap- tures the number of features which lie in a certain bucket of the grid. Since the number of buckets rapidly increases with increasing d, multi-dimensional discrete cosine trans- form (DCT) is applied in order to reduce histogram sizes. With the help of an adequate sampling strategy, only the most important DCT coefficients are selected for represen- tation. 8 bytes are used per DCT coefficient (4 bytes for the histogram index and 4 bytes for the histogram value). In their experiments in [Kim et al., 2002] 2, 000 till 2, 500 coefficients are used. The entity responsible for resource selection can apply inverse DCT in order to recover the his- 184 togram with low error rates. For multi-dimensional range queries, the hyper-sphere representing the search region is approximated by multiple hyper-squares. Histogram infor- mation is used to determine the selectivity of individual re- sources w.r.t. the query. The method is further extended in order to support resource selection in heterogeneous set- tings, where each resource may use its own local similar- ity measure which may be different from a global similar- ity measure used by the entity performing resource selec- tion. In [Kim and Chung, 2003], different resource selec- tion strategies are evaluated which do not rely on histogram information and instead use queried feature vectors in or- der to determine relevant resources by applying different regression models on distributions of global and local sim- ilarity value pairs of queried feature vectors. 2.2 P2P IR Systems P2P IR systems are often classified as being structured or unstructured overlay networks. As a secondary classifica- tion criterion, we introduce the distinction between data- independent and data-dependent overlays in order to reflect if a peer’s content or e.g. query profiles have an effect on overlay generation. This distinction is helpful to pinpoint different characteristics in a more organized way. In the following, we will briefly discuss various approaches. Unstructured P2P IR Systems Data-independent: Main protocols in this group are Plan- etP [Cuenca-Acuna et al., 2003] and its extension Ru- morama [Müller et al., 2005b]. In Rumorama, a peer sees the network as a single, small PlanetP network (called sub- net) with connections to other peers that see other PlanetP subnets. Each peer can choose the size of its subnet ac- cording to local processing power and bandwidth capacity. Within a subnet, a peer knows data summaries of all other peers in the same subnet. Gossiping techniques are used to disseminate the summaries. In a subnet, summary-based resource selection allows for semantic query routing. Ad- ditionally, a peer maintains a small set of links pointing to neighboring peers in other subnets in order to be able to forward queries outside the boundaries of its own subnet. In its original form, peers are assigned to subnets arbitrar- ily, i.e. independent of the peers’ content. But, Rumorama can be easily extended by a grouping of peers similar to the content-dependent overlays described in the following. Ad- ditionally, summaries might be visualized and thus be ben- eficial for interactive retrieval, e.g. by providing—with low bandwidth requirements—a visual overview of peer data for a large number of peers. Routing indexes in various forms (for references cf. [Doulkeridis et al., 2009]) represent aggregated infor- mation in an unstructured network maintained at a peer for all its neighboring peers in order to decide in which di- rection queries should be forwarded. Initially designed for one-dimensional values in order to avoid network flooding, they have e.g. been extended to allow for multi-dimensional queries. Data-dependent: Many semantic overlay networks (SONs) (for references and a detailed description cf. [Doulkeridis et al., 2010]) can be characterized as data- dependent, unstructured P2P networks. Here, the content of a peer’s data or information about past queries defines a peer’s place in the network. Thus, summaries of a peer’s content or query profiles are needed. Two types of links are usually maintained: short links grouping peers with simi- lar content or query profiles into so called “clusters of in- terest” (COIs) and long links that are established between different COIs. During query execution the query has to be forwarded to the most promising COI(s). In order to form COIs, clustering, classification as well as gossiping techniques can be applied. Structured P2P IR Systems Data-independent: Structured P2P IR systems are based on distributed indexing structures with distributed hashta- bles (DHTs) being the most prominent class member. Every peer in the network is usually responsible for a certain range of the feature space. Thus, when enter- ing the network or updating local content, indexing data has to be transferred to remote peers according to the peers’ responsibilities. In case of data-independent, struc- tured P2P IR systems, terms (cf. [Bender et al., 2005]) or high-dimensional feature vectors for CBIR (cf. [No- vak et al., 2008; Lupu et al., 2007; Vu et al., 2009]) are usually mapped to one-dimensional or multi-dimensional keys which can be indexed in a classical DHT such as Chord [Stoica et al., 2001] or CAN [Ratnasamy et al., 2001] respectively. Data-dependent: SONs—as described above—can also be implemented on top of a DHT in order to enhance query routing [Doulkeridis et al., 2010]. Clustering, classifica- tion as well as gossiping techniques are applied in order to establish links to peers with similar content. Indexing of Multiple Criteria In structured, data-independent systems, correlations be- tween different criteria (e.g. geographic and image content information) are difficult to exploit when indexing multi- ple feature types. If we e.g. assume an image from the Sahara Desert with shades of beige sand and blue sky, different peers might be responsible for indexing the ge- ographic and the image content information. Therefore, when distributing the indexing data of the Sahara image, querying for it, or removing it from the network, (at least) two different peers have to be contacted. Within SONs, the simultaneous indexing of multiple criteria would re- quire the definition of a similarity between peers and im- ages combining e.g. geographic and image content infor- mation. Alternatively, multiple overlays might be main- tained. Within unstructured, data-independent P2P IR sys- tems, it is possible to apply one summary and a correspond- ing resource selection technique per feature type. Feature- specific peer rankings can be combined by applying an al- gorithm for the merging of ranked lists [Belkin et al., 1995; Ilyas et al., 2008]. Alternatively, the creation of summaries and resource selection algorithms integrating multiple fea- ture types is possible (cf. [Hariharan et al., 2008]). Hybrid Approaches and Super-Peer Architectures A main characteristic of unstructured P2P IR systems is that a peer only administers indexing data of media items which belong to its user. Thus, when entering the system or updating media items, full indexing data does not have to be transferred to remote peers. Peer autonomy is bet- ter respected compared to structured networks [Doulkeridis et al., 2010]. On the other hand, structured systems offer query processing with logarithmic cost. In order to reduce the load imposed on the network when inserting new media items, super-peer architectures [Papapetrou et al., 2007] as well as DHT-based indexing of compact data summaries instead of full indexing data has been proposed (cf. [Lupu et al., 2007]). 185 In general, there is a convergence of structured and un- structured P2P IR systems with many hybrid approaches. We have e.g. evaluated an approach where indexing data is stepwisely transferred amongst peers in order to make peers more focused and—as a consequence—summaries more selective. More selective summaries with peers hav- ing specialized on a certain range of the feature space lead to more efficient resource selection [Eisenhardt et al., 2008]. There is plenty of work addressing super-peer architec- tures (for references cf. [Doulkeridis et al., 2009]). They are designed in order to overcome some limitations of “true" P2P IR systems and make use of increased capabili- ties such as storage capacity, processing power or available network bandwidth. Often, concepts known from “true" P2P IR systems are extended and transferred to super-peer networks. Also within super-peers the convergence of dif- ferent approaches can be seen. [Doulkeridis et al., 2009] e.g. apply multi-dimensional routing indexes on a super- peer level and additionally group similar super-peers close together in order to allow for better query routing. In this context, our resource selection techniques are not restricted to data-independent, unstructured P2P IR systems. The summaries can also be used within data- dependent, unstructured P2P IR systems to form COIs and within structured networks e.g. to be indexed in a DHT. In addition, summaries could be used by super-peers for se- lecting either “normal” peers or other super-peers. Further application fields are also possible as will be described in the following section. 2.3 Possible Application Fields apart from the P2P Context In addition to P2P IR (cf. Sect. 2.2) our resource summa- rization and selection techniques can also be used in tradi- tional distributed IR applications (cf. Sect. 2.1). Personal meta-search is a novel application of distributed IR, where all the online resources of a person are queried (e-mail accounts, web pages, image collections, etc.). These re- sources are typically heterogeneous in size, media type and update frequency possibly requiring selective and space ef- ficient summaries in this context [Thomas and Hawking, 2009]. Our summarization techniques might also be applied within (visual) sensor [Elahi et al., 2009] as well as ad hoc networks [Lupu et al., 2007]. Within sensor networks, lim- ited processing power, bandwidth and energy capacities ne- cessitate aggregation techniques which are based on local information with a clear focus on space efficiency. [Lupu et al., 2007] present an approach for ad hoc information shar- ing based on mobile devices when people meet at certain events or places. Here, it might not be feasible to transfer complete indexing data but only summarized information. Distributed IR techniques can also be used for vertical selection within aggregated search [Arguello et al., 2009]. Vertical selection is the task of identifying relevant ver- ticals, i.e. focused search services such as image, news, video or shopping search. A user issuing a textual query “music beatles” might also be interested in music videos and thus the results of video search or small previews should be integrated in result presentation of classical web search. In this context, a vertical can be interpreted as a re- source and the task of selecting relevant verticals is similar to resource selection in distributed IR requiring adequate features, i.e. resource descriptions, and corresponding se- lection mechanisms. Space efficient resource descriptions might also be ben- eficial in the context of recommender systems and social search e.g. in order to compute the similarity between dif- ferent users of social network sites. Similar users can be determined not only based on having the same friends, using the same tags, bookmarking the same media items, etc. [Guy et al., 2010], but also depending on the similarity of media content. Another potential application area is automatic theme identification. Automatic theme identification of photo sets e.g. in case of digital print products1 is concerned with the task of finding suitable background themes for a given set of images. Themes can be travel, wedding, etc. Each theme can be described by a set of photos and modeled as a sum- mary. Afterwards, the theme descriptions and the descrip- tion of a user’s collection can be compared in order to rec- ommend the best matching theme(s). Resource selection techniques have been successfully applied to blog site search [Elsas et al., 2008]. A blog feed is viewed as a single collection and individual posts are in- terpreted as documents in order to retrieve similar blogs according to a given information need. A similar approach can be undertaken in passage retrieval such as XML re- trieval where different sections, subsections, etc. might be grouped together as a resource (for references cf. [Lalmas, 2009]). Also expert search [Balog et al., 2009] could presum- ably be built based on resource description and selection techniques. Here, a user is interested in finding human ex- perts in an enterprise for example. Thus, e.g. all documents a person has (co)authored could be modeled as a resource and finding an expert would result in selecting the most promising resource. Compact resource descriptions might also be valuable for focused crawling [Ahlers and Boll, 2009]. If a service provides summaries of the image content of a certain web- site or media archive, a crawler could estimate the potential usefulness of this resource for its focused crawling task be- fore actually visiting the source. This way, crawl efficiency can be improved by preventing the crawler from analyzing too many irrelevant pages. Web traffic imposed by down- loading large sets of images in order to extract CBIR fea- tures can thus be avoided. Tree-based index structures are also related to our work (cf. [Samet, 2006]). The decision of choosing the best subtree is similar to the resource selection problem. Sum- maries in the P2P context correspond to aggregations main- tained in the nodes of a tree, e.g. bounding boxes in the case of an R-tree [Guttman, 1984]. 3 Resource Summarization and Selection for low-level Visual Content Features [Müller et al., 2005a] proposed the use of “cluster his- tograms” for distributed CBIR. In order to compute clus- ter histograms, a moderate number of reference points is used (e.g. k = 256). This set of reference points is known to all peers. Every image feature vector of a peer’s lo- cal image collection is assigned to the closest reference point. Hereby, a cluster histogram is computed counting how many image feature vectors of a peer’s collection are 1 http://comminfo.rutgers.edu/conferences/mmchallenge/2010/ 03/11/cewe-challenge/, last visit: 8.7.2010 186 closest to a certain reference point, i.e. cluster centroid cj (1 ≤ j ≤ k). Reference points are determined by dis- tributed k-Means clustering which imposes some load on the network. During resource selection only histogram in- formation regarding the cluster whose reference point lies closest to the query feature vector is used. Peers with more feature vectors assigned to this cluster are ranked higher than peers with fewer feature vectors assigned to the clus- ter. In [Eisenhardt et al., 2006] the performance of resource selection is further improved. A list L of reference points cj is sorted in ascending order according to the distance of cj to the query feature vector q. In order to rank peer pa before pb or vice versa L is processed from the beginning possibly till the end. The first element of L corresponds to the cluster centroid being closest to q. A peer with more documents in this cluster is ranked higher than a peer with fewer documents in the very cluster. If two peers pa and pb administer the same amount of images in the analyzed cluster and the end of the list has not yet been reached, the next element out of L is chosen and based on the number of documents within the current cluster it is again tried to rank pa before pb or vice versa. As a second modification, dis- tributed clustering is replaced in [Eisenhardt et al., 2006] by a random selection of reference points. Overall ranking selectivity is slightly affected, but there is no longer any network load imposed due to distributed clustering. We have extended the work from [Müller et al., 2005a; Eisenhardt et al., 2006] in several directions (cf. [Blank et al., 2007]). First, within highly fine-grained summaries (HFSk with k indicating the number of reference points used) we increased the number of reference points for com- puting the cluster histogram, e.g. from k = 256 to k = 8, 192 or even more. By doing so, the feature space is par- titioned in a more fine-grained way offering improved rank- ing selectivity. Since a higher number of reference points would lead to less space efficient summaries, we apply compression techniques. Thus, we can achieve better rank- ing selectivity with more space efficient resource descrip- tions compared to the approaches in [Müller et al., 2005a; Eisenhardt et al., 2006]. The average size of a peer’s sum- mary information is approx. 110 bytes for k = 16, 384, which is clearly less compared to other approaches in dis- tributed CBIR (cf. Sect. 2.1) if they were applied to our scenario directly. Second, within our approach reference points are selected from an external source and transferred to peers together with updates of the P2P software. This leads to a decrease in overall network load and makes dis- tributed selection mechanisms obsolete. Ranking selectiv- ity is slightly affected by this change as will be shown in Sect. 3.3. 3.1 Experimental Setup In the experiments we use a 166-dimensional uniformly quantized color histogram2 based on the HSV color space with 18 hues, 3 saturations and 3 values, plus 4 levels of gray. Image feature vectors are compared using Euclidean distance. We perform 20 runs where we change the ref- erence points used. During a run we perform 100 queries where we randomly select a query image from the under- lying collection. The set of queries stays constant over all runs. By analyzing the number of peers which are con- tacted on average in order to retrieve the 20 closest feature 2 http://www.gnu.org/software/gift/, last visit: 5.7.2010 1 10 100 1000 10000 1 10 100 1000 10000 N um be r o f i m ag es p er p ee r Peers ordered by decreasing number of images Flickr collection Figure 3: Distribution of peer sizes. vectors w.r.t. a given query feature vector we assume that the most similar images are the ones the user is interested in. We crawled a collection of 233, 827 Flickr images. They are assigned to peers based on the Flickr user ID in order to reflect a realistic scenario. Hence, we assume that every Flickr user operates a peer of its own. The images are mapped to 10, 601 peers/users which are used in our simulation. Fig. 3 shows the distribution of peer sizes, i.e. the number of images which are maintained per peer. The general characteristic is typical for P2P file sharing applica- tions with few peers managing large amounts of the images and many peers administering only few images [Saroiu et al., 2002]. 3.2 Using UFS for CBIR When summarizing geographic footprints, binary his- tograms (so called UFSk: ultra fine-grained summaries) outperformed HFSk (cf. [Blank and Henrich, 2010]). In contrast to HFS, UFS are based on a bit vector with the bit at position j indicating if centroid j is the closest centroid to one or more of a peer’s image feature vectors. Hence, we obtain a bit vector of size k. Of course, there is some loss of information when switching from HFS to UFS with k staying constant. However, UFS have the potential of resulting in more space efficient resource descriptions. Po- tentially, this allows for more centroids being used which might result in similar or even improved ranking selectiv- ity compared to HFS. In the following we will thus evaluate the use of UFS in the context of high-dimensional feature vectors for CBIR. 3.3 Analysis of Ranking Selectivity Reference points for summary creation and peer ranking are chosen from the underlying collection (UFS/HFS) or a second collection of 45, 931 Flickr images (UFSe/HFSe with “e” indicating the use of an external collection for the reference points). It is important to note that both collec- tions are disjoint w.r.t. the unique Flickr image and user IDs, but there is some minor natural overlap amongst col- lections w.r.t. image content; 24 of the 233, 827 images also appear in the external collection, because some images are uploaded by multiple users independently on Flickr. Fig. 4 shows the number of peers which are contacted on average in order to retrieve the 20 closest feature vec- tors w.r.t. a given query feature vector. Ranking selectivity increases degressively with increasing k. There is a gap in ranking selectivity when choosing the reference points 187 0 5 10 15 20 25 256 1024 4096 8192 16384 Fr ac tio n of p ee rs c on ta ct ed [in % ] Number of reference points HFS UFS HFSe UFSe Figure 4: Fraction of peers contacted to retrieve top-20 im- age feature vectors (ranking selectivity). from an external collection (HFSe compared to HFS and UFSe compared to UFS). The gap increases with increas- ing k. For UFS/HFS, with increasing k also the probabil- ity of choosing a centroid which is used also as a query feature vector increases. Such situations might lead to im- proved ranking selectivity, since queries are randomly cho- sen from the underlying data collection. The evaluation of other sources of feature vectors as queries will be part of future work. Fig. 4 additionally shows slightly improved ranking selectivity for HFS(e)256 compared to UFS(e)256 respectively3, which is due to the use of non-binary his- togram information during peer ranking within HFS(e). In general, this gap more and more diminishes when increas- ing the number of centroids used since HFS(e) histograms more and more pass into binary histograms. Already for HFS1,024 compared to UFS1,024 and HFSe1,024 compared to UFSe1,024 there is no noticeable difference in ranking selectivity at all. 3.4 Analysis of Summary Sizes The size of the resource descriptions after zipping is an- alyzed in Fig. 5 and Fig. 6. Fig. 5 shows average sum- mary sizes savg when using UFSe instead of HFSe. The plot for UFS and HFS shows similar characteristics. In ad- dition, Fig. 6 visualizes the different quartiles and mini- mum/maximum values of the summary sizes in a box plot. It shows that the median in case of HFSe is bigger com- pared to UFSe. Interquartile ranges of HFSe and UFSe be- come more and more similar when increasing k although the overall range of HFSe summary sizes is greater than the range of summary sizes in case of UFSe. All distributions of summary sizes are positively skew indicating many peers with small summary sizes and few peers with big summary sizes. Thus, the distribution of peer sizes (cf. Fig. 3) is re- flected in the distribution of summary sizes (cf. Fig. 6). One might think of a hybrid peer ranking scheme e.g. using HFSe for the smaller peers (i.e. peers with few docu- ments) and UFSe for the bigger peers (i.e. peers with many documents) in order to reduce network load imposed by ru- mor spreading. The cost of one round of rumor spreading can be estimated by savg · n · (n − 1) with n being the number of peers in a PlanetP-like network. Hence, the cost 3 HFS(e)k is used as an abbreviation for “HFSk and HFSek”. The same notation is also adopted for UFS throughout the paper. In a similar way, HFS/UFS also abbreviates “HFS and UFS”. 0 50 100 150 200 250 256 1024 4096 8192 16384 Av er ag e su m m ar y siz e [in by tes ] Number of reference points HFSe UFSe Figure 5: Avg. summary sizes (zipped). 0 500 1000 1500 2000 2500 256 1024 4096 8192 16384 Su m m ar y siz e [in by tes ] Number of reference points HFSe UFSe Figure 6: Box plot of summary sizes. is proportional to savg . Since an increase in ranking se- lectivity can be perceived only for small values of k when switching from HFS(e) to UFS(e) respectively (cf. Fig. 4), the UFS(e) alternative can be safely chosen for all peers in the network in case of big values of k. If there are big- ger differences in ranking selectivity amongst competing resource description and peer ranking schemes, the cost for query processing has to be additionally taken into account. A more detailed analysis can be found in [Blank and Hen- rich, 2010]. 3.5 Analysis of Time Complexity In general, it is important that peer ranking can be done within a reasonable amount of time. As described above, ranking peers mainly means sorting k-dimensional num- bers where the importance of the single dimensions is de- fined by the list L which contains the reference points sorted according to their distance to the query feature vec- tor. In a first run the peers are sorted w.r.t. the dimension representing the closest reference point. Of course, this sorting can be done in O(n · log n) where n stands for the number of peers in the considered PlanetP network. In a worst case scenario all peers would be identical in the num- ber of media items maintained in each of the k clusters end- ing up in a complexity of O(k · n · log n). Thus, the worst case complexity for calculating a peer ranking depends on k which of course is disadvantageous for HFS(e)/UFS(e) with high values of k. 188 0 5 10 15 20 25 256 1024 4096 8192 16384 Fr ac tio n of p ee rs c on ta ct ed [in % ] Number of reference points UFSe UFSe(256) UFS UFS(256) Figure 7: Fraction of peers contacted for retrieving top-20 images with UFS(e)(256) vs. UFS(e) (ranking selectivity). In order to test whether this worst case scenario has practical implications, we compared the original approach considering the clusters to the end, if necessary, with a modified variant using at most the 256 clusters closest to the query. If no decision is possible after consider- ing the histogram values for these 256 clusters, a random choice is made. In the following UFS(e)k(256) will denote the modified approach considering k centroids for sum- mary creation while at most only applying the 256 clos- est centroids w.r.t. the query feature vector for peer rank- ing. Results can be seen in Fig. 7. There is no notice- able difference in ranking selectivity for UFS(e)k com- pared to UFS(e)k(256) respectively for summaries with up to k = 8, 192 centroids. UFS(e)16,384 performs slightly better than UFS(e)16,384(256). When increasing k, the fea- ture space is partitioned in a more fine grained way. If only 256 centroids are used for peer ranking, the frac- tion of unused centroids which potentially contain relevant information increases, e.g. in case of UFS(e)16,384(256), 1 − 25616,384 = 98.4% of summary information is discarded during peer ranking. The results in Fig. 7 demonstrate two things. First, obvi- ously very few of the k histogram bins are usually consid- ered for peer ranking. Otherwise the differences between UFS(e) and UFS(e)(256) would have been higher. Second, programmers anxious about worst case bounds can stop processing after considering a certain number of histogram bins and thus avoid the worst case of O(k · n · log n). 4 Conclusion & Outlook In this paper we have applied and adapted binary his- tograms which were originally designed for the summa- rization of low-dimensional spatial data for resource de- scription and selection based on high-dimensional CBIR features. Compared to earlier work, summaries can be zipped more efficiently. A huge number of reference points (e.g. 16, 384) is applied in order to generate resource de- scriptions. We have shown that it is possible to use only a small fraction of reference points (e.g. 256) during peer ranking in order to speed-up query processing with only a marginal decrease in ranking selectivity. In future work we will try to find an adequate stopping criterion which indicates when it is no longer beneficial to contact further peers. This might be woven with a tech- nique that adaptively determines the number of centroids used for peer ranking. Additionally we will apply our re- source descriptions in order to summarize local image fea- tures such as SIFT. References [Ahlers and Boll, 2009] Dirk Ahlers and Susanne Boll. Adaptive geospatially focused crawling. In Proc. of the 18th ACM Conf. on Information and Knowledge Management, pages 445–454, Hong Kong, China, 2009. [Allali et al., 2008] Soufyane Allali, Daniel Blank, Wolfgang Müller, and Andreas Henrich. Image data source selection using Gaussian mixture models. In Adaptive Multimedial Re- trieval: Retrieval, User, and Semantics: 5th Intl. Workshop, AMR 2007, Paris, France, 2007, Revised Selected Papers, pages 170–181, Berlin, Heidelberg, 2008. Springer LNCS 4918. [Arguello et al., 2009] Jaime Arguello, Fernando Diaz, Jamie Callan, and Jean-Francois Crespo. Sources of evidence for vertical selection. In Proc. of the 32nd Intl. ACM SIGIR Conf. on research and development in Information Retrieval, pages 315–322, Boston, MA, USA, 2009. [Balog et al., 2009] Krisztian Balog, Ian Soboroff, Paul Thomas, Nick Craswell, Arjen de Vries, and Peter Bailey. Overview of the TREC 2008 enterprise track. In The Seventeenth Text Re- trieval Conf. Proceedings (TREC 2008). NIST, 2009. Special Publication. [Belkin et al., 1995] N. J. Belkin, P. Kantor, E. A. Fox, and J. A. Shaw. Combining the evidence of multiple query representa- tions for information retrieval. Inf. Processing and Manage- ment, 31(3):431–448, 1995. [Bender et al., 2005] Matthias Bender, Sebastian Michel, Ger- hard Weikum, and Christian Zimmer. The minerva project: Database selection in the context of P2P search. In Daten- banksysteme in Business, Technologie und Web, 11. Fach- tagung des GI-Fachbereichs "Datenbanken und Information- ssysteme" (DBIS), pages 125–144, Karlsruhe, Germany, 2005. [Berretti et al., 2004] Stefano Berretti, Alberto Del Bimbo, and Pietro Pala. Merging results for distributed content based im- age retrieval. Multimedia Tools Appl., 24(3):215–232, 2004. [Blank and Henrich, 2009] Daniel Blank and A. Henrich. Summarizing georeferenced photo collections for im- age retrieval in P2P networks. In Proc. of Workshop on Geographic Information on the Internet, pages 55–60, http://georama-project.labs.exalead.com/ workshop/GIIW-proceedings.pdf (12.11.2009), Toulouse, France, 2009. [Blank and Henrich, 2010] Daniel Blank and Andreas Henrich. Description and selection of media archives for geographic nearest neighbor queries in P2P networks. In Proc. of IAPMA2010: Information Access for Personal Media Archives Workshop, pages 22–29, http://doras.dcu. ie/15373/ (25.5.2010), Milton Keynes, UK, 2010. [Blank et al., 2007] Daniel Blank, Soufyane El Allali, Wolfgang Müller, and Andreas Henrich. Sample-based creation of peer summaries for efficient similarity search in scalable peer-to- peer networks. ACM SIGMM Workshop on Multimedia In- formation Retrieval (MIR 2007), Augsburg, Germany, pages 143–152, 2007. [Bockting and Hiemstra, 2009] Sander Bockting and Djoerd Hiemstra. Collection Selection with Highly Discriminative Keys. In Proc. of the 7th Workshop on Large-Scale Distributed Systems for Information Retrieval, http://lsdsir09. isti.cnr.it/lsdsir09-1.pdf (26.04.2010), Boston, MA, USA, 2009. [Callan, 2000] Jamie Callan. Distributed information retrieval. In W. B. Croft, editor, Advances in Information Retrieval, pages 127–150. Kluwer Academic Publishers, 2000. 189 [Chang and Zhang, 1997] Wendy Chang and Aidong Zhang. Metadata for distributed visual database access. In 2nd IEEE Metadata Conf., pages 1–11, Silver Spring, MD, USA, 1997. [Chang et al., 1997] Wendy Chang, Gholamhosein Sheik- holeslami, Aidong Zhang, and Tanveer F. Syeda-Mahmood. Efficient resource selection in distributed visual information systems. In Proc. of the 5th ACM Intl. Conf. on Multimedia, pages 203–213, Seattle, Washington, USA, 1997. [Cuenca-Acuna et al., 2003] Francisco Cuenca-Acuna, Christo- pher Peery, Richard P. Martin, and Thu D. Nguyen. PlanetP: Using gossiping to build content addressable peer-to-peer in- formation sharing communities. In IEEE Intl. Symp. on High Performance Distributed Computing, pages 236–246, Seattle, WA, USA, 2003. [Doulkeridis et al., 2009] Christos Doulkeridis, Akrivi Vlachou, Kjetil Nørvåg, Yannis Kotidis, and Michalis Vazirgiannis. Multidimensional routing indices for efficient distributed query processing. In Proc. of the 18th ACM Conf. on Infor- mation and Knowledge Management, pages 1489–1492, Hong Kong, China, 2009. [Doulkeridis et al., 2010] Christos Doulkeridis, Akrivi Vlachou, Kjetil Nørvåg, and Michalis Vazirgiannis. Handbook of Peer- to-Peer Networking. Part 4: Distributed Semantic Overlay Net- works. Springer Science+Business Media, 1st edition, 2010. [Duda et al., 2000] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. Wiley-Interscience, 2nd edition, Nov. 2000. [Eisenhardt et al., 2006] Martin Eisenhardt, Wolfgang Müller, Andreas Henrich, Daniel Blank, and Soufyane El Allali. Clustering-based source selection for efficient image retrieval in peer-to-peer networks. In 8th IEEE Intl. Symp. on Multime- dia, pages 823–830, San Diego, CA, USA, 2006. [Eisenhardt et al., 2008] Martin Eisenhardt, Wolfgang Müller, Daniel Blank, Soufyane El Allali, and Andreas Henrich. Clustering-based, load balanced source selection for CBIR in P2P networks. Intl. Journal of Semantic Computing (IJSC), 2(2):235–252, 2008. [Elahi et al., 2009] B. Maryam. Elahi, Kay Römer, Benedikt Os- termaier, Michael Fahrmair, and Wolfgang Kellerer. Sensor ranking: A primitive for efficient content-based sensor search. In Intl. Conf. on Information Processing in Sensor Networks, pages 217–228, Washington, DC, USA, 2009. IEEE. [Elsas et al., 2008] Jonathan L. Elsas, Jaime Arguello, Jamie Callan, and Jaime G. Carbonell. Retrieval and feedback mod- els for blog feed search. In Proc. of 31st Intl. ACM SIGIR Conf. on research and development in Information Retrieval, pages 347–354, Singapore, 2008. [Guttman, 1984] Antonin Guttman. R-trees: A dynamic index structure for spatial searching. In ACM SIGMOD Conf., pages 47–57, Boston, MA, 1984. ACM. [Guy et al., 2010] Ido Guy, Michal Jacovi, Adam Perer, Inbal Ronen, and Erel Uziel. Same places, same things, same people?: mining user similarity on social media. In CSCW ’10: Proc. of ACM Conf. on Computer Supported Cooperative Work, pages 41–50, Savannah, Georgia, USA, 2010. [Hariharan et al., 2008] Ramaswamy Hariharan, Bijit Hore, and Sharad Mehrotra. Discovering GIS sources on the web using summaries. In Proc. of 8th ACM/IEEE joint Conf. on digital libraries, pages 94–103, Pittsburgh, PA, USA, 2008. ACM. [Ilyas et al., 2008] Ihab F. Ilyas, George Beskales, and Mo- hamed A. Soliman. A survey of top-k query processing tech- niques in relational database systems. ACM Comput. Surv., 40(4):1–58, 2008. [Ioannidis, 2003] Yannis Ioannidis. The history of histograms (abridged). In Proc. of 29th Intl. Conf. on Very Large Data Bases, pages 19–30, Berlin, Germany, 2003. [Kim and Chung, 2003] Deok-Hwan Kim and Chin-Wan Chung. Collection fusion using Bayesian estimation of a linear regres- sion model in image databases on the web. Inf. Processing and Management, 39:267–285, 2003. [Kim et al., 2002] Deok-Hwan Kim, Seok-Lyong Lee, and Chin- Wan Chung. Heterogeneous image database selection on the web. The Journal of Systems and Software, 64:131–149, 2002. [Lalmas, 2009] Mounia Lalmas. XML retrieval. Synthesis Lec- tures on Information Concepts, Retrieval and Services. Morgan & Claypool Publishers, 2009. [Lee et al., 1999] Ju-Hong Lee, Deok-Hwan Kim, and Chin-Wan Chung. Multi-dimensional selectivity estimation using com- pressed histogram information. SIGMOD Record, 28(2):205– 214, 1999. [Lupu et al., 2007] Mihai Lupu, Jianzhong Li, Beng Chin Ooi, and Shengfei Shi. Clustering wavelets to speed-up data dis- semination in structured P2P manets. In Intl. Conf. on Data Engineering, pages 386–395, Istanbul, Turkey, 2007. IEEE. [Müller et al., 2005a] W. Müller, M. Eisenhardt, and A. Henrich. Fast retrieval of high-dimensional feature vectors in P2P net- works using compact peer data summaries. Multimedia Sys- tems, 10(6):464–474, 2005. [Müller et al., 2005b] Wolfgang Müller, Martin Eisenhardt, and Andreas Henrich. Scalable summary based retrieval in P2P networks. In Proc. of the 14th ACM Conf. on Information and Knowledge Management, pages 586–593, Bremen, Germany, 2005. [Novak et al., 2008] David Novak, Michal Batko, and Pavel Zezula. Web-scale system for image similarity search: When the dreams are coming true. In Intl. Workshop on Content- Based Multimedia Indexing, pages 446–453, London, UK, 2008. IEEE. [Papapetrou et al., 2007] Odysseas Papapetrou, Wolf Siberski, Wolf-Tilo Balke, and Wolfgang Nejdl. DHTs over peer clus- ters for distributed information retrieval. In 21st Intl. Conf. on Advanced Information Networking and Applications, pages 84–93, Niagara Falls, Canada, 2007. IEEE. [Ratnasamy et al., 2001] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, and Scott Schenker. A scalable content-addressable network. In ACM SIGCOMM, pages 161– 172, San Diego, CA, USA, 2001. [Samet, 2006] H. Samet. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2006. [Saroiu et al., 2002] Stefan Saroiu, P. Krishna Gummadi, and Steven D. Gribble. A measurement study of peer-to-peer file sharing systems. In ACM/SPIE Multimedia Computing and Networking, pages 156–170, San Jose, CA, USA, 2002. [Stoica et al., 2001] Ion Stoica, Robert Morris, David Karger, Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for Internet applications. In SIG- COMM’01, pages 149–160, San Diego, CA, USA, 2001. [Thomas and Hawking, 2009] Paul Thomas and D. Hawking. Server selection methods in personal metasearch: a compar- ative empirical study. Information Retrieval, 12(5):581–604, 2009. [Vu et al., 2009] Quang Hieu Vu, Mihai Lupu, and Sai Wu. Simpson: Efficient similarity search in metric spaces over P2P structured overlay networks. In Proc. of 15th Intl. Euro- Par Conf. on Parallel Processing, pages 498–510, Delft, The Netherlands, 2009. Springer. 190 Benutzerorientiertes Dokumenten-Clustering durch die Verwendung einer Anfragemenge Marc Lechtenfeld Informationssysteme Universita¨t Duisburg-Essen Abstract Der Einsatz einer Anfragemenge ermo¨glicht die Beru¨cksichtigung des Informationsbedu¨rfnisses des Benutzers beim Strukturierungsprozess von Dokumenten. Der vorgestellte benutzerorientier- te Ansatz kann den Benutzer damit bei mehreren Suchaktivita¨ten unterstu¨tzen. Geplante Benutzer- experimente sollen die Effektivita¨t, Effizienz und Benutzerzufriedenheit pru¨fen. 1 Einfu¨hrung Ein Benutzer kann mithilfe zahlreicher Information-Re- trieval-Verfahren nach wohlspezifizierten Informationen suchen, indem er sein Informationsbedu¨rfnis in Form einer Anfrage an sie richtet. Er ist jedoch nicht immer in der Lage sein Informationsbedu¨rfnis genau zu spezifizieren und in einer speziellen Systemanfrage auszudru¨cken. Um seine Aufgabe in diesem Fall trotzdem erfu¨llen zu ko¨nnen, ist er dann auf die Unterstu¨tzung weiterer Suchaktivita¨ten durch das System angewiesen [Belkin, 1993]. Explorative Suchaktivita¨ten Das Durchsto¨bern einer von einem Clustering-Verfahren strukturierten Dokumentmenge stellt fu¨r den Benutzer eine Mo¨glichkeit dar, nach relevanten Dokumenten zu suchen, ohne dabei sein Informationsbedu¨rfnis explizit ausdru¨cken zu mu¨ssen. Diese Dokumentmenge ko¨nnte beispielsweise eine große Kollektion von Dokumenten sein, die vom Be- nutzer u¨ber die in ihr enthaltenen Themen durchsucht wer- den kann oder ein Suchergebnis zu einer mehrdeutigen An- frage, das durch die Gruppierung der Dokumente mit glei- cher Bedeutung fu¨r den Benutzer versta¨ndlicher wird. Benutzerorientierung Klassische Dokumenten-Clustering-Verfahren strukturier- en Dokumentmengen allerdings ha¨ufig nur nach thema- tischen Gesichtspunkten. Das konkrete Informationsbe- du¨rfnis des Benutzers wird nicht direkt in den Strukturier- ungsprozess einbezogen. Ein Benutzer kann sich jedoch fu¨r unterschiedliche Aspekte eines Dokuments interessieren, beispielsweise fu¨r die Textsorte, die Versta¨ndlichkeit oder die inhaltliche Qualita¨t eines Dokuments. So mo¨chte er fu¨r eine Sammlung von Zeitungsartikeln vielleicht heraus- finden, u¨ber welche Ereignisse in den unterschiedlichen Ressorts negativ berichtet wurde. Dies erfordert beispiels- weise sowohl eine thematische Gruppierung der Doku- mente nach Ressorts (Politik, Sport, usw.) als auch eine Gruppierung nach Sentiment [Pang et al., 2002] (positive, neutrale und negative A¨ußerungen). 2 Benutzerorientiertes Dokumenten-Clustering Um ein fu¨r den Benutzer nu¨tzliches Clustering zu erzeugen, sollte sein Informationsbedu¨rfnis den Strukturierungspro- zess auf eine grundlegende Art und Weise beeinflussen. 2.1 Optimum Clustering Framework Die theoretische Basis dazu bildet das Optimum Cluster- ing Framework (OCF) [Fuhr et al., 2010]. Zur Bestimmung der Dokumenta¨hnlichkeit fu¨hrt es eine Anfragemenge ein, die in Verbindung mit einem Retrievalmodell dazu verwen- det wird, die Relevanz eines Dokuments fu¨r den Benut- zer zu scha¨tzen. Die A¨hnlichkeit zweier Dokumente ergibt sich dann aus dem Skalarprodukt der zwei Dokumentrepra¨- sentationen. Diese beiden Vektoren bestehen aus den Re- levanzwahrscheinlichkeiten bezu¨glich aller Anfragen der Anfragemenge. Dokumente werden demnach als a¨hnlich definiert, wenn sie bezu¨glich mo¨glichst vieler Anfragen gleichzeitig potenziell relevant sind. Klassische Dokumenten-Clustering-Verfahren verwen- den dabei eine Anfragemenge, die aus allen in der Doku- mentmenge vorkommenden Termen besteht. Experimen- telle Ergebnisse zeigen jedoch, das Clusterings besser wer- den, wenn die verwendeten Anfragen sta¨rker den tatsa¨ch- lichen Anfragen entsprechen, die der Benutzer fu¨r eine an- frageorientierte Suche verwenden wu¨rde. 2.2 Beru¨cksichtigung einer Anfragemenge Um eine Strukturierung der Dokumentmenge zu erzeugen, die sich sta¨rker am Informationsbedu¨rfnis des Benutzers orientiert und damit nu¨tzlicher fu¨r ihn ist, kann daher das Optimum Clustering Framework auf eine Anfragemenge angewandt werden, die an das Informationsbedu¨rfnis des Benutzers sta¨rker angepasst ist. Die Anfragemenge sollte dabei so gestaltet sein, dass sie die Anfragen, die das Informationsbedu¨rfnis des Benutzers am besten ausdru¨cken, entha¨lt, aber nicht wesentlich gro¨ßer ist. Beim Clustering eines Suchergebnisses kann diese An- fragemenge beispielsweise auch alle potenziellen Anfrage- erweiterungen der Suchanfrage enthalten. 2.3 Bestimmung der Anfragemenge Liegen u¨ber das Informationsbedu¨rfnis des Benutzers kei- nerlei Informationen vor, so erscheint es sinnvoll als An- fragemenge die Menge aller in der Dokumentmenge vor- kommenden Terme als Eintermanfragen zu verwenden. Dies entspricht dann dem klassischen Clustering nach dem dominantesten Aspekt, der in der Regel das Thema ist. Da eine spezialisierte Anfragemenge jedoch zu besser- en Strukturierungen fu¨hrt, sollte, wenn weitere Hinweise 191 u¨ber das konkrete Informationsbedu¨rfnis des Benutzers verfu¨gbar sind, eine Anfragemenge verwendet werden, die an diese Informationen angepasst ist. Anfragemengen fu¨r bestimmte Aspekte Durch die Interaktion mit dem Benutzer ko¨nnen besser an sein Informationsbedu¨rfnis angepasste Anfragemengen ge- funden werden. Eine Mo¨glichkeit dazu besteht darin, den Benutzer explizit nach den Aspekten zu fragen, fu¨r die er sich interessiert. Beispielsweise durch Vorlage einer Liste von Aspekten, die allgemein fu¨r viele Benutzer interessant sind und die in der vorliegenden Dokumentmenge vorkom- men. So ko¨nnte eine Kollektion von Bu¨chern z. B. nach den Themengebieten, nach den Leserbewertungen oder nach der Versta¨ndlichkeit gruppiert werden. Zur Strukturierung wird dann jeweils die vorbereitete Anfragemenge einge- setzt, die den Aspekt am besten beschreibt, den der Be- nutzer ausgewa¨hlt hat. Durch die Mo¨glichkeit zur Auswahl verschiedener Aspekte wird so ein mehrdimensionales Clustering aufgrund unterschiedlicher Aspekte mo¨glich. Individuelle Anfragemengen Das System ko¨nnte wa¨hrend der Interaktion mit dem Be- nutzer jedoch auch Informationen u¨ber den Benutzer indi- rekt sammeln. Diese ko¨nnten dabei helfen die anfa¨nglich große Anfragemenge, die ein allgemeines Informations- bedu¨rfnis ausdru¨ckt, sukzessive zu spezialisieren und so an das individuelle Informationsbedu¨rfnis des Benutzers anzupassen. Auf diese Weise wa¨re es vielleicht mo¨glich, automatisch zu bestimmen, fu¨r welche Aspekte sich der Benutzer wahrscheinlich interessiert. Mo¨gliche Quellen fu¨r die Eingrenzung der potenziellen Anfragen sind neben dem Interaktionsverhalten des Be- nutzers mit dem System beispielsweise auch Charakteris- tika der Kollektion, Adaptierungen der Benutzerschnittstel- le, die verwendeten Retrievalmodelle oder externe Quellen wie z. B. die Enzyklopa¨die Wikipedia. 2.4 Unterstu¨tzte Suchaktivita¨ten Die Kombination von Verfahren des Information-Retrieval und Dokumenten-Clustering ermo¨glicht die Unterstu¨tzung verschiedener Suchaktivita¨ten. Interpretation einer Dokumentmenge Einen ersten U¨berblick u¨ber eine unbekannte Kollektion oder u¨ber ein Suchergebnis kann sich der Benutzer mithilfe einer Strukturierung dieser Dokumentmengen verschaffen. So ko¨nnte man durch eine Gruppierung der Ergebnisse ein- er Buchsuche beispielsweise einen U¨berblick u¨ber aktuelle Programmiersprachen erhalten. Die Mo¨glichkeit die Strukturierung aufgrund unter- schiedlicher Aspekte durchfu¨hren zu lassen, kann dem Benutzer dabei helfen die Beschaffenheit der Dokument- menge oder sein Informationsbedu¨rfnis besser zu verste- hen. Die Bu¨cher u¨ber Programmiersprachen lassen sich beispielsweise auch nach den Vorkenntnissen z. B. in Bu¨- cher fu¨r Anfa¨nger und Fortgeschrittene oder nach der Le- serzufriedenheit u¨ber die Rezensionen gruppieren. U¨ber die jeder Gruppierung zugrundeliegende Anfrage- menge ist fu¨r den Benutzer mo¨glicherweise erkennbar, fu¨r welche Anfragen oder Aspekte die Dokumente der gerade betrachteten Gruppierung relevante Informationen liefern und wie sich die Dokumente der unterschiedlichen Cluster voneinander unterscheiden. Auf diese Weise kann er ler- nen, u¨ber welche Anfragen die gewu¨nschten Dokumente vom Retrievalsystem zuru¨ckgeliefert werden bzw. wie er eine Anfrage mit weiteren Termen erga¨nzen ko¨nnte, um die gewu¨nschten Dokumente zu erhalten. Beim Clustering eines Suchergebnisses kann man die Fragemenge beispiels- weise als Menge von mo¨glichen Frageerweiterungen be- trachten. Finden durch Erkennen Durch das Durchsto¨bern der gruppierten Dokumentmenge kann der Benutzer Dokumente finden, ohne sein Informa- tionsbedu¨rfnis explizit ausdru¨cken zu mu¨ssen. Er kann re- levante Dokumente entdecken, indem er die Dokumente – nach der Terminologie von Belkin [Cool and Belkin, 2002] – scannt und die relevanten Dokumente in der Dokument- menge als fu¨r ihn relevant erkennt. So ko¨nnte der Benutzer bei einer Suche nach einem Einsteigerbuch fu¨r eine neue Programmiersprache z. B. auf eine Untergruppe von Um- steigerbu¨chern stoßen, die fu¨r ihn nu¨tzlicher sind. Durch die Interaktion mit dem System etwa durch die Auswahl einzelner Cluster oder durch das manuelle Grup- pieren einzelner Dokumente, kann das System die Frage- menge automatisch Schritt fu¨r Schritt an das Informa- tionsbedu¨rfnis des Benutzers anpassen. Auf diese Weise kann ein auf den Benutzer zugeschnittenes Clustering der Kollektion oder der Ergebnismenge erzeugt werden. 3 Benutzerorientierte Evaluation Es sind Benutzerexperimente geplant, die die Nu¨tzlichkeit dieses Ansatzes fu¨r den Benutzer bestimmen sollen. Es sollen dabei insbesondere auch Informationsbedu¨rf- nisse untersucht werden, die sich nur schwer in Form ein- er spezifischen Anfrage ausdru¨cken lassen. Dabei ist zu beru¨cksichtigen, dass sich der Benutzer fu¨r verschiedene Aspekte interessieren kann, die zu unterschiedlichen Struk- turierungen fu¨hren ko¨nnen. Ein Vergleich der erzeugten Strukturierung mit einer manuell erstellten Klassifikation, die nur eine Sichtweise abbildet, reicht daher nicht. Mithilfe von Simulated Work Tasks [Borlund, 2000] soll gepru¨ft werden, ob der Benutzer die ihm gestellten Auf- gaben lo¨sen kann (Effektivita¨t), wie viel Zeit er dafu¨r be- no¨tigt (Effizienz) und wie zufrieden er ist (Zufriedenheit), also ob er damit insgesamt bei der Suche profitiert. Literatur [Belkin, 1993] Nicholas J. Belkin. Interaction with texts: Information retrieval as information-seeking behavior. In Information Retrieval, pages 55–66, 1993. [Borlund, 2000] Pia Borlund. Experimental components for the evaluation of interactive information retrieval systems. Journal of Documentation, 56:71–90, 2000. [Cool and Belkin, 2002] Colleen Cool and Nicholas J. Belkin. A classification of interactions with information. In H. Bruce, R. Fidel, P. Ingwersen, and P. Vakkari, edi- tors, Emerging frameworks and methods. Proceedings of the 4th COLIS, pages 1–15, Greenwood Village, 2002. Libraries Unlimited. [Fuhr et al., 2010] Norbert Fuhr, Marc Lechtenfeld, Ben- no Stein, and Tim Gollub. The Optimum Cluster- ing Framework: Implementing the Cluster Hypothesis. 2010. Submitted. [Pang et al., 2002] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? Sentiment classification us- ing machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 79–86, 2002. 192 Semiautomatische Konstruktion von Trainingsdaten fu¨r historische Dokumente Andrea Ernst-Gerlach, Norbert Fuhr Abteilung Informatik und Angewandte Kognitionswissenschaft Fakulta¨t fu¨r Ingenieurwissenschaften Universita¨t Duisburg-Essen 47048 Duisburg, Deutschland Abstract Fu¨r Retrieval in historischen Dokumenten wird eine Abbildung der Suchbegriffe auf die histo- rischen Varianten in den Dokumenten beno¨tigt. Fu¨r diese Abbildung wurde ein regelbasierter Ansatz entwickelt. Der Engpass dieses Ansatzes ist die Konstruktion der Trainingsdaten. Dabei muss ein Experte manuell den historischen For- men, die dem Spellchecker unbekannt sind, die aktuelle moderne Form zuordnen. Zur Verbes- serung dieses Verfahrens werden nun die Vor- schla¨ge des Spellcheckers betrachtet. Aus je- dem Vorschlag und dem zugeho¨rigen unbekann- ten Wort wird ein Beleg gebildet. Aus diesen Belegen werden nun wie gewohnt Regeln gene- riert und die ha¨ufigsten Regeln akzeptiert. Ex- perimentelle Ergebnisse basierend auf der bis- herigen Belegkollektion zeigen, dass ein großer Teil der Regeln auf diese Weise generiert wer- den kann. Dadurch ko¨nnen die Trainingsdaten deutlich schneller und mit geringerem manuel- lem Aufwand erzeugt werden. 1 Einleitung Die Anzahl der digitalen historischen Kollektionen steigt kontinuierlich. Aber bei verfu¨gbarer Volltextsuche fu¨r die Kollektionen werden viele Dokumente nicht gefunden, weil sie eine nicht-standardisierte Rechtschreibung ver- wenden. In vielen La¨ndern war die Schreibweise u¨ber vie- le Jahrhunderte hinweg nicht festgelegt. So wurde z. B. die deutsche Sprache erst 1901/1902 standardisiert [Kel- ler, 1986]. Davor galt das Prinzip “Schreibe wie Du sprichst” (phonologisches Prinzip der Rechtschreibung). Z. B. ist akzeptieren die moderne Form der Schreibvari- ante acceptieren. Die nicht-standardisierte Schreibweise fu¨hrt zu Fehlern, wenn in historischen Teilen von digitalen Bibliotheken ge- sucht wird. Die meisten Benutzer geben den Suchbegriff in moderner Sprache ein, die sich von der historischen Spra- che in den Dokumenten unterscheidet. Auch popula¨re Digitalisierungsinitiativen wie Google Book Search1 oder die europa¨ische digitale Bibliothek2 un- terstu¨tzen bisher keine Suche nach Schreibvarianten. Um dieses Problem zu lo¨sen, arbeitet unser Projekt an der Ent- wicklung einer Suchmaschine, bei der der Benutzer seine Anfragen in aktueller Schreibweise eingeben kann, wenn er 1http://books.google.com/ g. a. 27.08.2010 2http://www.europeana.eu/portal/ g. a. 27.08.2010 in historischen Dokumenten suchen mo¨chte (siehe [Ernst- Gerlach and Fuhr, 2007]). Andere Ansa¨tze benutzen dafu¨r wo¨rterbuchbasierte Me- thoden (z. B. [Hauser et al., 2007]). Allerdings hat die- se Vorgehensweise den entscheidenden Nachteil, dass nur Wo¨rter gefunden werden, die im Wo¨rterbuch enthalten sind. Außerdem ist der zeitliche Aufwand fu¨r den manu- ellen Aufbau der Wo¨rterbu¨cher relativ hoch. Die entwickelte Suchmaschine u¨berwindet diesen Nach- teil mit einem regelbasierten Ansatz, um das gesamte Vo- kabular abzudecken und dadurch den Recall zu erho¨hen. Dafu¨r werden Transformationsregeln entwickelt, die aus einem Suchbegriff die historischen Varianten generieren. Durch die Orts- und Zeitabha¨ngigkeit der Regeln mu¨ssen die Regelsa¨tze jeweils neu generiert werden, wenn ein neu- es Korpus verfu¨gbar wird. Diese Arbeit muss ohne Hilfe von Informatikern z. B. von Linguisten und Historikern ge- leistet werden. Deswegen ist es notwendig, ein Werkzeug zu entwickeln, das den Benutzer auf leicht versta¨ndliche und schnelle Art bei der Regelgenerierung unterstu¨tzt und dessen Benutzung keine Informatikkenntnisse voraussetzt. Im Folgenden gehen wir davon aus, dass der Benutzer fu¨r eine neue Kollektion eine Volltextsuche ermo¨glichen mo¨chte. Weil fu¨r Ort und Zeit der Kollektion kein Regelsatz vorhanden ist, muss der Benutzer zuna¨chst Belege sam- meln. Ein Beleg besteht aus der Flexionsform des Lemmas (im Folgenden Wortform genannt) und der zugeho¨rigen historischen Variante. Erst im zweiten Schritt ko¨nnen die Regeln generiert werden. Um die verschiedenen Benutzerinteressen darzustellen, wurden zwei Szenarios von Benutzertypen entworfen. Die Benutzertypen sollen dabei in erster Linie die Spannwei- te des notwendigen Bedarfs an Unterstu¨tzung bei der Er- stellung von Belegen und Regeln darstellen. Z. B. ko¨nnte fu¨r einen Linguisten bereits die Bildung der Belege ein interessantes Forschungsthema sein. Er mo¨chte die Bele- ge nur mit semi-automatischer Unterstu¨tzung generieren, weil er auch an der Entwicklung der Sprache sowie an Re- geln mit einer sehr hohen Precision interessiert ist. Er sucht oft nach allen Vorkommen eines Wortes in einer Kollekti- on und kann deswegen nur mit einem kompletten Regel- satz arbeiten. Im Gegensatz dazu mo¨chte ein Historiker le- diglich relevante Dokumente zu einem bestimmten Thema finden. Deswegen mo¨chte er mo¨glichst schnell eine Voll- textsuche nutzen. Demzufolge wird er einen automatischen Ansatz bevorzugen. Auch wenn er deswegen zuna¨chst ei- nige Dokumente nicht findet, reicht es ihm aus, falls er die Mo¨glichkeit hat, den Regelsatz spa¨ter zu u¨berarbeiten. Ausgehend von seinen Bedu¨rfnissen wird sich der Be- nutzer mehr auf den Recall oder die Precision seiner Su- che konzentrieren. Das Tool soll an dieser Stelle die not- 193 wendige Flexibilita¨t bieten. Dabei wird dem Benutzer Un- terstu¨tzung fu¨r den gesamten Prozess der Regelgenerierung angeboten. Allerdings bleibt es ihm selbst u¨berlassen, wie viele der vorgeschlagenen Belege (und Regeln) er akzep- tiert. Der vorliegende Artikel hat die folgende Struktur: Zuna¨chst wird ein kurzer U¨berblick u¨ber die verwandten Arbeiten im Bereich der Erstellung von Trainingsdaten ge- geben. Anschließend wird in Abschnitt 3 die Regelgenerie- rung erla¨utert. Abschnitt 4 zeigt, wie der Algorithmus zur Regelgenerierung auch zur automatischen Generierung von Belegen und Regeln verwendet werden kann. Der Ansatz wird in Abschnitt 5 evaluiert. Der letzte Abschnitt fasst den Artikel zusammen und gibt einen Ausblick auf zuku¨nftige Arbeiten. 2 Verwandte Arbeiten Gotscharek et. al. [Gotscharek et al., 2009] haben mit dem LeXtractor ein Werkzeug zur Konstruktion von his- torischen Lexika entwickelt. Die Lexikoneintra¨ge ko¨nnen auch als Belege fu¨r unseren Ansatz aufgefasst werden. Das Werkzeug hat zwei Ansichten. Die erste bietet dem Be- nutzer einen Text mit hervorgehobenen unbekannten Ter- men. In dieser Ansicht kann der Benutzer auf Basis des Textes Lexikoneintra¨ge erzeugen. Die zweite Ansicht zeigt eine Liste mit unbekannten Termen, die nach absteigender Termha¨ufigkeit geordnet ist. Durch die Arbeit mit dieser Liste kann der Prozentsatz der vom Lexikon abgedeckten Wo¨rter schnell gesteigert werden. Da das Werkzeug zur Lexikonerstellung verwendet wird, mu¨ssen die Ergebnisse eine hohe Pra¨zision aufweisen. Des- wegen muss ein Experte alle unbekannten Wo¨rter der gan- zen Kollektion bearbeiten und jede Schreibweise beurtei- len. Zur Unterstu¨tzung wird eine Liste mit Textstellen an- geboten, wenn ein Wort fu¨r die Konstruktion eines Le- xikoneintrags ausgewa¨hlt wird. Der LeXtractor verwen- det manuell erstellte Regeln (sog. Patterns), um potenzi- elle moderne Formen in einem aktuellen Lexikon zu fin- den. Wa¨hrend der Konstruktion von Lexikoneintra¨gen ist es auch mo¨glich, zusa¨tzliche Patterns vorzuschlagen, wenn eine neue Regel bemerkt wird. Pilz und Luther [Pilz and Luther, 2009] haben eine Me- thode entwickelt, um die Sammlung von Belegen in ih- rem Evidencer Werkzeug zu unterstu¨tzen. Die Methode soll vor allem den Arbeitsaufwand fu¨r diesen Schritt ver- ringern. Der Evidencer benutzt dazu einen Bayes’scher Klassifizierer. Dabei nehmen sie an, dass sich die Vertei- lung der N-gramme signifikant zwischen den Standard- und den Nicht-Standardschreibungen unterscheidet. Zur Einteilung in richtige Schreibweisen und Schreibvarianten scha¨tzt der Klassifizierer die Wahrscheinlichkeit, ob es sich um eine Schreibvariante handelt. Um diese Wahrschein- lichkeit abzuscha¨tzen, werden Trainingsbeispiele beno¨tigt. Nach der Trainingsphase wird eine Liste mit unbekann- ten Wo¨rtern pra¨sentiert. Diese sortiert die Wo¨rter abstei- gend nach der Wahrscheinlichkeit fu¨r Schreibvarianten. Der Benutzer kann den Klassifizierer anpassen, indem er den Grenzwert fu¨r mo¨gliche Varianten vera¨ndert. VARD 2 [Baron and Rayson, 2008] ist ebenfalls in der Lage, moderne Formen fu¨r Schreibvarianten in histori- schen Dokumenten zu finden. Das Werkzeug markiert alle potenziellen Varianten, die nicht in einem modernen Lexi- kon zu finden sind. Fu¨r jedes markierte Wort wird dem Be- nutzer eine Liste mit potenziellen zugeho¨rigen modernen Schreibungen angeboten. Der Benutzer kann aus der Liste dann die passende moderne Form auswa¨hlen. Ein zweiter Modus bietet zudem die Mo¨glichkeit, automatisch die Vor- schla¨ge mit dem ho¨chsten Ranking zu akzeptieren, wenn der Wert u¨ber einem vom Benutzer festgelegten Mindest- wert liegt. Um diese Vorschla¨ge zu generieren, werden die folgenden drei Methoden benutzt: • manuelle Liste von Beispielen fu¨r moderne Wo¨rter und die zugeho¨rigen Schreibvarianten, • modifizierte Version des SoundEx-Algorithmus, • manuell erstellte Liste von Ersetzungsregeln. Basierend auf diesen drei Methoden wird ein Konfidenz- wert fu¨r einen Vorschlag berechnet. Der Konfidenzwert ist dabei kein fester Wert, sondern wird nach jedem Schritt au- tomatisch angepasst. Der erste Ansatz beno¨tigt (aufgrund seines Einsatzge- bietes) ein großes Maß an manueller Interaktion sowohl fu¨r die Bildung der Belege als auch fu¨r die Bildung der Regeln. Der zweite Ansatz sieht mit Blick auf die auto- matische Unterstu¨tzung fu¨r den Benutzer vielversprechen- der aus. Der Benutzer hat hier die Mo¨glichkeit, die Qua- lita¨t der Ergebnisse durch einen Schwellwert zu beeinflus- sen. Allerdings beno¨tigt der Klassifizierer eine große An- zahl an Trainingsdaten. Somit ist einiges an manueller Ar- beit notwendig bevor der Klassifizierer eingesetzt werden kann. Außerdem kann der Benutzer nur dokumentweise vorgehen. Somit kann er nicht mehrere Vorkommen von mo¨glichen Varianten in verschiedenen Texten gleichzei- tig betrachten. Der dritte Ansatz sieht besonders wegen des sta¨ndig angepassten Konfidenzwertes fu¨r die moder- nen Formen sehr vielversprechend aus. Der Konfidenzwert ist ansonsten vergleichbar mit dem Bayes’scher Klassifi- zierer aus dem zweiten Ansatz. Der Nachteil von VARD 2 besteht in der Notwendigkeit von Trainingsdaten und Re- geln als Eingabe. In beiden Fa¨llen handelt es sich um ma- nuell gesammelte Daten. Des Weiteren ist der SoundEx- Algorithmus ein phonetischer Algorithmus, der fu¨r die eng- lische Sprache entwickelt wurde. Fu¨r die deutsche Sprache gibt es mit der Ko¨lner Phonetik [Postel, 1969] ebenfalls einen phonetischen Algorithmus. Allerdings sind phoneti- sche Algorithmen nur bedingt fu¨r die Erstellung von Re- geln fu¨r historische Schreibweisen geeignet, weil sie durch die zeitliche und ra¨umliche Vera¨nderung der Aussprache ebenfalls angepasst werden mu¨ssten. Zusammenfassend la¨sst sich feststellen, dass keiner der betrachteten Ansa¨tze Belege automatisch generieren kann. Dadurch beno¨tigen alle Werkzeuge einen hohen manuellen Aufwand, bevor sie einsatzbereit sind. Deswegen wu¨rde ein Werkzeug, das automatisch Belege fu¨r Trainingsmengen erzeugen kann, den Zugang zu historischen Dokumenten fu¨r den Benutzer deutlich erleichtern. 3 Generierung von Transformationsregeln Im Folgenden werden unsere bisherigen Methoden zum Sammeln von Belegen und zur Regelgenerierung [Ernst- Gerlach and Fuhr, 2006] kurz erla¨utert. Zuna¨chst werden Trainingsdaten beno¨tigt, die moderne Wortformen auf zu- geho¨rige historische Varianten abbilden. Mit Hilfe einer Rechtschreibpru¨fung bekommen wir eine Liste von po- tenziellen historischen Schreibweisen. Zur Rechtschreib- pru¨fung benutzen wir Hunspell3, der zur Zeit Wo¨rterbu¨cher fu¨r 98 Sprachen zur Verfu¨gung stellt. Die Vorschla¨ge 3http://hunspell.sourceforge.net/ g. a. 27.08.2010 194 fu¨r falsch geschriebene Wo¨rter basieren auf N-gramm- Vergleichen, Regeln und Aussprachedaten. Mit diesen Me- thoden werden dann auf Basis eines Wo¨rterbuchs Vor- schla¨ge erstellt. Durch eine manuelle U¨berpru¨fung wird festgestellt, ob es sich bei den unbekannten Wo¨rtern wirklich um Schreibvarianten handelt. Ist dies der Fall, so werden die zugeho¨rigen modernen Formen bestimmt. Zusa¨tzlich beno¨tigen wir noch die Termha¨ufigkeiten der historischen Formen. Anschließend ko¨nnen wir uns auf den na¨chsten Schritt konzentrieren — die Regelgenerierung. Die automatische Regelgenerierung startet mit den er- stellten Trainingsdaten. Dabei verwenden wir ein Tri- pel bestehend aus der modernen Wortform, der zu- geho¨rigen Schreibvariante sowie der Kollektionsha¨ufigkeit der Schreibvariante. Zuna¨chst vergleichen wir jeweils die beiden Wo¨rter und bestimmen sogenannte ”Regelkerne”. Diese beinhalten die notwendigen Transformationen und identifizieren den zu- geho¨rigen Kontext. Z. B. ergibt sich fu¨r die moderne Wort- form unnu¨tz und die historische Form unnuts die folgen- de Menge, die aus zwei Regelkernen besteht: (unn(u¨→u)t), (t(z→s)). In zweiten Schritt werden fu¨r jeden Regelkern die zugeho¨rigen Regelkandidaten bestimmt. Diese beru¨cksichtigen auch die Kontextinformationen (z. B. Konsonant (C) oder Wortende ($)) der modernen Schreib- weise. Fu¨r das oben gezeigte Beispiel werden unter anderem die folgenden Regelkandidaten generiert: u¨→ u, nu¨→ nu, u¨t→ ut, nu¨t→ nut, Cu¨→ Cu, z$→s$. Im letzten Schritt werden die nu¨tzlichen Regeln durch Pruning der Regelmenge (wobei auch die Kollekti- onsha¨ufigkeit beru¨cksichtigt wird) durch eine modifizier- te Version des PRISM Algorithmus (siehe [Cendrowska, 1987]) bestimmt. Dafu¨r werden zuna¨chst automatisch ne- gative Belege erstellt und Precisionwerte fu¨r die Regeln be- rechnet. Anschließend werden die Regeln ausgewa¨hlt, die eine festgelegte Mindestprecision sowie eine Mindestvor- kommensha¨ufigkeit aufweisen. 4 Automatisch akzeptierte Belege Der letzte Abschnitt hat verdeutlicht, dass der bisheri- ge Ansatz am Anfang einen hohen manuellen Aufwand beno¨tigt. Deswegen ist es unser Hauptziel einen Algo- rithmus zu entwickeln, der Belege automatisch generieren kann, um diesen Aufwand zu reduzieren. Die Annahme, dass Schreibvarianten ein bestimmtes Maß an Regularita¨t beinhalten, bildet die Basis fu¨r den re- gelbasierten Ansatz. Basierend auf dieser Annahme sollen im Folgenden auch die Belege automatisch generiert wer- den. Die richtige moderne Form einer Schreibvariante be- findet sich ha¨ufig unter den Vorschla¨gen des Spellcheckers. Wir nehmen an, dass diese Regularita¨ten zwischen moder- ner Form und Schreibvariante deutlich seltener auch zwi- schen Schreibvarianten und falschen Vorschla¨gen zu finden sind. Deswegen konzentriert sich unser Algorithmus (siehe Abbildung 1) auf das Problem, den richtigen Vorschlag des Spellcheckers zu einer Variante zu bestimmen. Dabei wird der Vorschlag ausgewa¨hlt, der u¨ber die ha¨ufigeren Regel- kandidaten verfu¨gt. Aus jeder unbekannten Schreibweise und den zu- geho¨rigen Vorschla¨gen wird ein Beleg generiert (siehe Ta- belle 1). Diese Belege bilden die Trainingsmenge aus der mo¨gliche Regelkandidaten generiert werden. Da wir in die- sem Schritt noch nicht die endgu¨ltigen Regeln generieren, sind hier die unterschiedlichen Regelkandidaten nicht re- levant und es werden nur die Regelkerne betrachtet. Auf diese Weise erhalten wir eine eindeutigere Verteilung der Regeln. Je ha¨ufiger ein Beleg in unterschiedlichen Belegen auf- taucht, desto ho¨her ist die Wahrscheinlichkeit, dass die Regel sinnvoll ist. Deswegen wird auch die Precision fu¨r Belege, die auf ha¨ufigeren Regelkernen basieren, ho¨her sein. Demzufolge wird in jedem Durchlauf von den nicht akzeptierten Regelkandidaten derjenige mit der gro¨ßten Ha¨ufigkeit akzeptiert (siehe Tabelle 2). Haben mehrere Re- gelkandidaten die gleiche Ha¨ufigkeit werden zuna¨chst Sub- stitutionsregeln akzeptiert, da diese meistens eine ho¨here Precision als Einfu¨ge- und Lo¨schregeln haben. Z. B. wird i→ y gegenu¨ber s→ ∅, ∅ → h bevorzugt. Regelkern Regelha¨ufigkeit Entscheidung i→ y 7 akzeptieren un→ ∅ 1 un→ ge 1 ∅ → ge 1 w → ∅ 1 ster → ck 1 ar → zey 1 l→ z 1 i→ yt 1 mann→ zeyt 1 ∅ → je 1 ig → ∅ 1 Tabelle 2: Beispiel fu¨r sortierte Regelkerne Vorschlag Mo¨gliche Entscheidung Variante Geschicklichkeit Geschicklichkeyt akzeptieren jederzeit jederzeyt akzeptieren obgleich obgleych akzeptieren Sonderheit Insonderheyt markiere i→ y als akzeptiert Tabelle 3: Beispiel zur Akzeptanz des Regelkerns i→ y Nachdem wir einen Regelkandidaten akzeptiert haben, werden die zugeho¨rigen Belege (und damit die Vorschla¨ge der Rechtschreibpru¨fung) betrachtet. Basiert ein Beleg nur auf der akzeptierten Regel, wird er direkt akzeptiert. Ba- siert er auf mehreren Regeln, wird er akzeptiert, wenn alle anderen Regeln ebenfalls akzeptiert sind. Ansonsten wird lediglich markiert, dass der Regelkandidat akzeptiert ist (siehe Tabelle 3). Nachdem wir nun akzeptierte Vorschla¨ge haben, ko¨nnen wir im na¨chsten Schritt falsche Vorschla¨ge aussortieren. Dafu¨r nehmen wir an, dass zu jeder Schreibvariante nur ei- ne moderne Schreibung existiert. Dadurch ko¨nnen wir die weiteren Vorschla¨ge fu¨r historische Schreibweisen entfer- nen. Diese Annahme stellt eine Vereinfachung dar. Wie Pilz [Pilz, 2009] gezeigt hat, verfu¨gt die Schreibvarian- te Hunngern u¨ber die modernen Formen Ungarn und Hungern. Die Vereinfachung ist an dieser Stelle not- wendig, um die automatische Beleggenerierung u¨berhaupt zu ermo¨glichen und somit den manuellen Aufwand fu¨r die Konstruktion der Trainingsdaten deutlich zu reduzie- ren. Diese Einschra¨nkung gibt es bei der spa¨teren ma- 195 Bilde Trainingsmenge aus Vorschla¨gen Generiere Regelkandidaten (nur Regelkerne) Solange Regelkandidaten rj mit Regelha¨ufigkeit > min Regelha¨ufigkeit existieren Sortiere Regelkandidaten nach Ha¨ufigkeit Akzeptiere den ha¨ufigsten nicht akzeptierten Regelkandidaten ri Markiere den Regelkandidaten ri fu¨r alle zugeho¨rigen Vorschla¨ge si als markiert Akzeptiere alle si bei denen alle Regelkandidaten akzeptiert wurden Wenn si akzeptiert ist lo¨sche alle konkurrierenden Vorschla¨ge sk Abbildung 1: Algorithmus zur automatischen Beleggenerierung Vorschlag Mo¨gliche Varianten Regelkandidaten Geschicklichkeit Geschicklichkeyt i→ y Ungeschicklichkeit Geschicklichkeyt un→ ∅, i→ y Unschicklichkeit Geschicklichkeyt un→ ge, i→ y Schicklichkeit Geschicklichkeyt ∅ → ge, i→ y Geschwisterlichkeit Geschicklichkeyt w → ∅, ster → ck, i→ y jederzeit jederzeyt i→ y jederart jederzeyt ar → zey jederlei jederzeyt l→ z, i→ yt jedermann jederzeyt mann→ zeyt derzeitig jederzeyt ∅ → je, i→ y, ig → ∅ Tabelle 1: Beispiel fu¨r Trainingsdaten und generierte Regeln nuellen Bearbeitung von Belegen nicht. Außerdem gehen wir davon aus, dass die Regeln, die durch diese Annah- me verloren gehen, durch andere Belege mitgeneriert wer- den. Wa¨hrend des Lo¨schungsprozesses werden die falschen Belege auch von weiteren zugeho¨rigen Regelkernen ent- fernt. Anschließend startet der Prozess wieder mit dem ha¨ufigsten unbehandelten Regelkandidaten. Der Benutzer kann diesen Prozess durch folgende Para- meter beeinflussen: • Minimale Wortla¨nge: Rechtschreibprogramme gene- rieren fu¨r kurze Wo¨rter meistens mehr Vorschla¨ge als fu¨r lange Wo¨rter. Zusa¨tzlich ist die Wahrscheinlich- keit eine falsche historische Form zu generieren, deut- lich ho¨her, weil sich kurze Wo¨rter mit einer gro¨ßeren Wahrscheinlichkeit a¨hneln als lange Wo¨rter. Deswe- gen kann die Precision fu¨r die automatischen Belege durch eine minimale Wortla¨nge erho¨ht werden. • Minimale Anzahl an Regelvorkommen: Der Kern un- seres Ansatzes besteht darin, dass die Regelha¨ufigkeit ein Indikator fu¨r die Precision ist. Deswegen ist die Regelha¨ufigkeit ein offensichtlicher Parameter. Als untere Grenze muss daru¨ber hinaus eine Regel min- destens zweimal vorkommen. • Maximale Anzahl der Regelanwendungen pro Wort: Je mehr Regelanwendungen beno¨tigt werden, um ein modernes Wort auf eine potenzielle Variante abzu- bilden, desto unwahrscheinlicher ist es, dass es sich um eine Schreibvariante handelt. Insbesondere kurze Wo¨rter ko¨nnen sehr leicht auf komplett andere Wo¨rter abgebildet werden. Z. B. kann durch drei Regelanwen- dungen auf derzeitig das Wort jederzeyt generiert wer- den (siehe Tabelle 1). Aus dieser Parameterauswahl bevorzugt der Historiker mo¨glicherweise eine ku¨rzere Wortla¨nge, eine geringere Anzahl an Regelvorkommen sowie eine ho¨here maximale Anzahl an Regelanwendungen, um einen hohen Recall zu erreichen. Auf diese Weise kann er, sofern er mo¨chte, direkt mit der Suche auf der Kollektion beginnen. Im Gegensatz dazu wird der Linguist eventuell genau die umgekehrte Pa- rameterauswahl treffen. Der vorgestellte Ansatz wurde bereits in den RuleGe- nerator integriert. Der RuleGenerator ist ein interaktives Werkzeug und bietet eine graphische Benutzeroberfla¨che mit der der Benutzer Belege sammeln (siehe [Awakian, 2010]) und Regeln generieren kann (siehe [Korbar, 2010]). Die Ergebnisse des Prozesses zur automatischen Regelge- nerierung werden dem Benutzer in einer graphischen Be- nutzeroberfla¨che in Form einer Liste angezeigt (siehe Ab- bildung 1). Darin werden zur Zeit Tripel aus moderner Wortform, Schreibvariante und den zugeho¨rigen Regeln dargestellt. Der Benutzer kann anschließend einzelne Be- lege oder alle Belege komplett akzeptieren. Zusa¨tzlich bie- tet die Liste Zugang zu den Textstellen in denen die Vari- anten vorkommen. Dadurch kann der Benutzer die Wo¨rter auch in dem Kontext betrachten, wenn der Bedarf besteht. Im Anschluss an die Bearbeitung der automatischen Belege kann der Benutzer aus den noch nicht zugeordneten unbe- kannten Wo¨rtern weitere Belege bilden. 5 Evaluierung Als Testkollektion wurde unsere Belegedatenbank gewa¨hlt. Sie wurde basierend auf Texten der Nietzsche Rezepti- on4 sowie weiteren kleineren Kollektionen5 aufgebaut. Die 4http://www2.inf.uni-due.de/ Studienprojekte/Nietzsche/pp2001/die_cd/ die_cd.htm g. a. 27.08.2010 5Digitales Archiv Hessen-Darmstadt http://www. digada.de/index.html g. a. 27.08.2010, Bibliotheca Augustana. FH Augsburg. http://www.hs-augsburg. de/˜harsch/augustana.html, g. a. 27.08.2010, docu- mentArchiv.de http://www.documentarchiv.de g.a. 27.08.2010 196 Anzahl Belege 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Precision 0,28 0,25 0,25 0,24 0,22 0,19 0,18 0,18 0,17 0,17 Recall Regel 0,55 0,47 0,50 0,47 0,50 0,50 0,51 0,51 0,50 0,49 Recall Regelvorkommen 0,91 0,90 0,92 0,92 0,94 0,95 0,95 0,95 0,96 0,96 Tabelle 4: Recall und Precision fu¨r Regelkerne auf Basis Testdaten des jeweiligen Durchlaufs Anzahl Belege 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Regel Automatisch 0,14 0,24 0,31 0,36 0,39 0,41 0,43 0,46 0,47 0,49Manuell 0,26 0,51 0,63 0,77 0,80 0,82 0,84 0,90 0,94 1 Regelvorkommen Automatisch 0,10 0,20 0,30 0,39 0,48 0,58 0,68 0,77 0,86 0,96Manuell 0,10 0,22 0,32 0,42 0,52 0,61 0,71 0,80 0,90 1 Tabelle 5: Recall fu¨r Regelkerne auf Basis Gesamttestdaten Abbildung 2: Benutzeroberfla¨che fu¨r automatische Belege Texte stammen u¨berwiegend aus dem 16. bis 19. Jahrhun- dert. An diesem Beispiel soll nachvollzogen werden, wie viel manueller Aufwand bei der Erstellung einer Trainings- kollektion gespart werden kann. Um den sukzessiven Auf- bau der Kollektion nachzustellen, wurde die Anzahl der als Testdaten verwendeten Belege von 1000 schrittweise um jeweils 1000 bis auf 10000 Belege erho¨ht. Fu¨r jede Test- menge wurde unser Verfahren angewendet. Als Parame- tereinstellung wurde mit einer Mindestwortla¨nge von fu¨nf, mindestens zwei Regelvorkommen und maximal zwei Re- gelanwendungen pro Wort eine recallorientierte Auswahl getroffen. Dies wird insbesondere durch den geringen Wert fu¨r die minimale Vorkommensha¨ufigkeit der Regeln deut- lich. Es wurden zuna¨chst Recall und Precisionwerte fu¨r die Regelkerne berechnet (siehe Tabelle 4). Die Precision sinkt von 0,28 Punkten bei 1000 Belegen bis 0,17 bei 10000 Be- legen. Dies la¨sst sich durch die deutliche Parameterwahl zu Gunsten des Recall erkla¨ren. Da eine Regel nur zwei- mal vorkommen musste, um akzeptiert zu werden, steigt mit steigender Anzahl an Testdaten auch die Wahrschein- lichkeit, dass ein falscher Regelkern in einem weiteren po- tenziellen Belegpaar vorkommt. Beim Recall sind dagegen keine Auswirkungen der steigenden Anzahl der Trainings- daten zu bemerken. Er hat eine Spannbreite von 0,47 bis 0,55. Somit la¨sst sich ungefa¨hr die Ha¨lfte der Regelkerne automatisch generieren. Da sich die einzelnen Regeln sehr stark in ihrer Anwen- dungsha¨ufigkeit unterscheiden, wurde zusa¨tzlich auch noch der Recall basierend auf der Vorkommensha¨ufigkeit der einzelnen Regeln berechnet. Hier zeigt sich deutlich, dass vor allem die besonders ha¨ufigen Regeln generiert werden, da immer mindestens 90 % der Regelkerne gefunden wer- den. Um die Entwicklung der Regelabdeckung einscha¨tzen zu ko¨nnen, wurde noch die Entwicklung der Recallwerte der Regelkerne bezogen auf die Gesamtmenge der Belege berechnet (siehe Tabelle 5). Als Vergleichbasis fu¨r unser Verfahren diente dabei der Recall der Regelkerne bei ma- nueller Erstellung der Testkollektion. Der Benutzer muss 2000 Belege manuell betrachten, um denselben Recall zu erreichen wie mit dem automatischen Verfahren. Betrach- tet man den manuellen Aufbau der Testkollektion bezogen auf die Regelha¨ufigkeit zeigt sich, dass der Benutzer sogar u¨ber 9000 Belege manuell erzeugen muss, um den Recall so zu erho¨hen, wie er es mit den automatisch erzeugten Be- 197 Anzahl Belege 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Automatische Belegabdeckung 0,88 0,91 0,93 0,93 0,94 0,94 0,94 0,95 0,95 0,95 Manuelle Belegabdeckung 0,91 0,95 0,96 0,98 0,98 0,98 0,99 0,99 1 1 Tabelle 6: Recall Belegabdeckung legen ko¨nnte. Um festzustellen, wie viele Belege man mit den gene- rierten Regelkernen wiederfinden ko¨nnte, wurde auch noch der Recall fu¨r die Belegabdeckung berechnet (siehe Tabel- le 6). Dies geschah sowohl fu¨r die automatische als auch fu¨r die manuelle Vorgehensweise. Dabei zeigt sich, dass in beiden Fa¨llen bereits mit den aus 1000 Belegen gene- rierten Regeln fast 90% der Belege abgedeckt werden. Da- durch wird nochmals die Regelhaftigkeit von Schreibvari- anten gezeigt. Der Benutzer muss in etwa 2000 Belege ma- nuell bewerten, um denselben Recall wie beim automati- schen Ansatz zu erzielen. 6 Fazit In diesem Artikel wurde ein Ansatz zur automatischen Konstruktion von Belegen vorgestellt. Die Belege werden als Eingabe fu¨r den Prozess der Regelgenerierung beno¨tigt, der Retrieval in Texten mit nicht-standardisierter Recht- schreibung ermo¨glicht. Der vorgestellte Ansatz bietet dem Benutzer die Mo¨glichkeit mehrere Parameter einzustellen. Dadurch ist der Ansatz sehr flexibel, weil der Benutzer den Prozess der Beleggenerierung entsprechend seiner Erwar- tungen an Recall und Precision beeinflussen kann. Die Evaluierung hat deutlich gezeigt, dass die ha¨ufigen Regelkerne ausgewa¨hlt werden. Wenn der Benutzer diese direkt akzeptiert, muss er sich nur noch die Wo¨rter an- schauen, bei denen es nicht mo¨glich war, einen Vorschlag automatisch zuzuordnen. Im weiteren Verlauf des Projek- tes soll daran gearbeitet werden, die Liste der verblieben unbekannten Wo¨rter nach sinkender Irregularita¨t zu sortie- ren. Da die automatische Belege bereits nach sinkender Ha¨ufigkeit sortiert sind, wird diese Liste demna¨chst an- hand der Regeln sortiert. Dadurch bekommt der Benutzer schneller einen U¨berblick u¨ber mo¨gliche Regeln. Außer- dem la¨sst sich so neben vera¨nderten Parametereinstellun- gen auch die relativ geringe Precison steigern, weil der Be- nutzer die automatischen Belege schneller bearbeiten kann. Literatur [Awakian, 2010] A. Awakian. Development of a user- interface for an interactive rule development. Master’s thesis, University of Duisburg-Essen, 2010. [Baron and Rayson, 2008] A. Baron and P. Rayson. Vard 2: A tool for dealing with spelling variation in historical corpora. In Proceedings of the Postgradua- te Conference in Corpus Linguistics, 2008. Aston University, Birmingham. [Cendrowska, 1987] J. Cendrowska. PRISM: An algo- rithm for inducing modular rules. International Journal on Man-Machine Studies, 27(4):349–370, 1987. [Ernst-Gerlach and Fuhr, 2006] Andrea Ernst-Gerlach and Norbert Fuhr. Generating search term variants for text collections with historic spellings. volume 3936 of Lecture Notes in Computer Science, Heidelberg, 2006. Springer Verlag. ISBN 3540333479. [Ernst-Gerlach and Fuhr, 2007] Andrea Ernst-Gerlach and Norbert Fuhr. Retrieval in text collections with historic spelling using linguistic and spelling variants. In JCDL ’07: Proceedings of the 7th ACM/IEEE-CS joint confe- rence on Digital libraries, pages 333–341, New York, NY, USA, 2007. ACM. [Gotscharek et al., 2009] A. Gotscharek, A. Neumann, U. Reffle, Ch. Ringlstetter, and K. U. Schulz. Enabling information retrieval on historical document collections: the role of matching procedures and special lexica. In AND ’09: Proceedings of The Third Workshop on Ana- lytics for Noisy Unstructured Text Data, pages 69–76, New York, NY, USA, 2009. ACM. [Hauser et al., 2007] Andreas Hauser, Markus Heller, Eli- sabeth Leiss, Klaus U. Schulz, and Christiane Wanzeck. Information access to historical documents from the ear- ly new high german period. In Proceedings of the In- ternational Joint Conference on Artificial Intelligence (IJCAI-2007) Workshop on Analytics for Noisy Unstruc- tured Text Data, 2007. Hyderabad, India, January. [Keller, 1986] R. Keller. Die Deutsche Sprache und ihre historische Entwicklung. Helmut Buske Verlage, Ham- burg, Germany, 1986. ISBN 3875481046. [Korbar, 2010] D. Korbar. Visualisation of rule structu- res and rule modification possibilities for texts with non-standard spelling. Master’s thesis, University of Duisburg-Essen, 2010. [Pilz and Luther, 2009] T. Pilz and W. Luther. Automated support for evidence retrieval in documents with non- standard orthography. In Susanne Winkler Sam Feather- ston, editor, The Fruits of Empirical Linguistics Process, volume 1, pages 211–228, 2009. [Pilz, 2009] T. Pilz. Nichtstandardisierte Rechtschreibung - Variationsmodellierung und rechnergestu¨tzte Variati- onsverarbeitung. PhD thesis, University of Duisburg- Essen, 2009. [Postel, 1969] Hans Joachim Postel. Die Ko¨lner Pho- netik. Ein Verfahren zur Identifizierung von Personen- namen auf der Grundlage der Gestaltanalyse. IBM- Nachrichten, (19):925–931, 1969. 198 Abstract This paper summarizes parts of the NTCIR-Geo- Time Task held in Tokyo June 15-18, 2010. This task was the first evaluation specifically of search with both Geographic and Temporal con- straints, i.e. it combines geographic information retrieval (GIR) with time-based search to find specific events in a multilingual collection. We describe the data collections (Japanese and Eng- lish news stories), topic development, research approaches, assessment results and lessons learned from the evaluation. 1 Introduction Semantic search queries and factoid questions require se- mantic processing to deliver results beyond bag-of-words search. Geo-temporal search concerns search which has both geographic and temporal constraints. In particular the search for events or to answer questions about events con- tains, often, specificity of location (where) and specifity of time (when). A simple example might be "When and where did Rosa Parks die?" in which the user wishes to know a "specific" date (is it year or month-year?) and "spe- cific" location (should it be city?) to answer the question. A more complex question "How long after the Sumatra earthquake did its tsunami hit Sri Lanka?" has geographic constraints and wishes to extract a somewhat specific tem- poral expression (e.g. "a few hours") from the document collection being searched. The above examples are taken directly from NTCIR-GeoTime, the first evaluation of geo- temporal search recently presented (mid-June 2010) at the eighth NTCIR Workshop in Tokyo. The results clearly demonstrated that semantic markup for geography and time outperformed traditional IR methodologies. Cultural Geographic search is quite prevalent in many modern search venues. A great number of documents (web, news, and scientific) have a geographic focus. Geo- graphic search allows for a unique user interface, the in- teractive map, which can be utilized not only to narrow the user’s focus by geography, but also to highlight inter- esting events. Geographic information retrieval is con- cerned with the retrieval of thematically and geographic- ally relevant information resources in response to a query of the form {}, e.g. ``Temples within 5 km. of Tokyo''. [Larson 1996, Jones et al 2004]. It has been estimated that 22 per- cent of web searches are location based [Asadi et al 2005]. Systems that support GIR, such as geographic digital lib- raries, and location-aware web search engines, are based on a collection of georeferenced information resources and methods to spatially search these resources with geo- graphic location as a key. Information resources are con- sidered geo-referenced if they are spatially indexed by one or more regions on the surface of the Earth, where the specific locations of these regions are encoded either dir- ectly as spatial coordinates, i.e. geometrically, or indir- ectly by place name [Hill 2006]. However, in order for place names to support a spatial approach to GIR, they must be associated with a model of geographic space. There have been over six workshops [Purves and Clough 2010] on Geographic Information Retrieval (GIR) held in association with SIGIR, CIKM, ECDL or other confer- ences as well as workshops and conference tracks on loca- tion-based search, there has also been 4 years of evalu- ation of GIR within CLEF (the GeoCLEF and GikiCLEF tracks [Mandl et al 2008,Santos et al 2010]). But, until this track at NTCIR, Asian language geographic search had never been specifically evaluated, even though about half of the NTCIR-6 Cross-Language topics had a geo- graphic component (usually a restriction to a particular country). The temporal aspects of search have been largely ig- nored in the IR community, but not in the GIS and com- putational linguistics communities. There has been a spe- cial issue of ACM TALIP on 'Temporal Information Pro- cessing' [Mani, Pustejovsky and Sundheim 2004], as well as at least two workshops on "Temporal and Spatial In- formation Processing". Use of temporal information in web search and results presentation (hit clustering) was explored in [Alonso, Gertz and Baeza-Yates 2007]. The NTCIR-GeoTime task organizers wanted to utilize and in- corporate past research on this aspect as part of the evalu- ation. 2 Data and Test Topics Two news story collections were used for NTCIR-Geo- Time, one Japanese and one English. The Japanese col- lection was Mainichi newspapers for 2002-2005, which had 377,941 documents. The English collection, con- sisted of 315,417 New York Times stories also for 2002- 2005. Users of the NYT collection had to pay a fee of $50US to the Linguistic Data Consortium to prepare and mail the DVD with this collection. Details about these collections and their characteristics may be found in the An Evaluation of Geographic and Temporal Search Fredric Gey†, Noriko Kando‡, Ray Larson† †University of California, Berkeley USA ‡National Institute of Informatics, Tokyo JAPAN gey@berkeley.edu , ray@ischool.berkeley.edu , kando@nii.ac.jp, 199 GeoTime Overview [Gey et al 2010]. The collections matched those used in other tasks in NTCIR-8 (Advanced Cross-Language Question Answering [Mitamura et al 2010] and Multilingual Opinion Tracking [Seki et al 2010]). Using Wikipedia as the 'ground truth', the organizers cre- ated 25 topics in English , phrased as questions, from the annual notable events summary.1 Each of the 25 topics was vetted to hit at least one relevant document in both languages – the non-Japanese-speaking organizers used Google-translate to translate the topic and run it against the Mainichi collection and translate and examine the top documents. The process of topic development is also dis- cussed in the Overview [Gey et al 2010]. Four topics were of the form ‘When and where did die?’ with one minor variation: GeoTime0007: How old was Max Schmeling when he died, and where did he die? An- other question was looking for a fixed list – Geo- Time0016: When and where were the last three Winter Olympics held? Another, similar question – Geo- Time0021: When and where were the 2010 Winter Olympics host city location announced? was very difficult because it wanted to know where (Prague, Czech Repub- lic at the 115th session, July 2, 2003) the IOC (Interna- tional Olympic Committee) announced that Vancouver would host the 2010 Winter games. In the opinion of the organizers, the most difficult topic was expected to be GeoTime0025: How long after the Sumatra earthquake did the tsunami hit Sri Lanka? This did prove to be one of the difficult topics, but not necessarily the most diffi- cult. Topics were formatted in XML structures contain- ing a description field and a more extensive narrative field, in both English and Japanese, as in: When and where did Astrid Lindgren die? いつ、どこでアストリッド・リ ンドグレーンは亡くなりましたか? The user wants to know when and in what city the children's author Astrid Lindgren died. ユーザは、いつ、どの都市で、児童 書作家のアストリッド・リンドグレーンが死亡したかを知りたい と思っている。 The full set of topics may be found at: http://metadata.berkeley.edu/NTCIR-GeoTime/topics.php 3 Evaluation and Results An evaluation run consisted of a ranked list of up to 1000 documents for each topic. Relevance judging was done in a traditional manner on a pool of the top 100 doc- uments retrieved from all runs with duplicates removed. 3.1 Teams Submitting Evaluation Runs Six teams submitted runs for the English collection and five registered teams ran the 25 topics against the Japan- ese collection (three other groups agreed to submit runs to broaden the pool – two of these groups are labeled 'an- onymous' below). 1 e.g. http://en.wikipedia.org/wiki/2002 Team Name Organization submitting English runs BRKLY University of California, Berkeley DCU Dublin City University, Ireland IITH† International Institute of Technology, Hy- derabad INESC National Institute of Electroniques and Computer Systems, Lisbon, Portugal UIOWA University of Iowa XLDB University of Lisbon, Portugal † Run submitted late, not included in pooling Table 1: Groups Submitting English Runs Team Name Organization submitting Japanese runs Anon1 Anonymous submission 1 BRKLY University of California, Berkeley FORST Yokohama National University, Japan HU-KB Hokkaido University, Japan KOLIS Keio University, Japan Anon2 Anonymous submission 2 M National Institute of Materials Science, Japan OKSAT Osaka Kyoiku University, Japan Table 2: Groups Submitting Japanese Runs The English groups submitted a total of 25 runs (a maximum of 5 different runs per team were allowed) and the Japanese groups submitted 34 distinct runs. 3.2 Results Results in [Gey et al 2010] are displayed using three relat- ively well-established evaluation measures: Average Pre- cision (AP), Q Measure, and Normalized Discounted Cu- mulative Gain (nDCG). Details about these evaluation measures which were also used for the IR4QA (Informa- tion Retrieval for Question-Answering) task of NTCIR-8 may be found in [Sakai et al 2010]. For simplicity we only display the nDCG results in the following table to show relative performance. A run is specified by team- name-topic-language-document-language-run_number-D or DN where D means description only which DN means description and narrative were used from the topic (the IIIT submission did not specify which fields were used) RUN nDCG INESC-EN-EN-05-DN 0.6246 UIOWA-EN-EN-01-D 0.6228 BRKLY-JA-EN-01-DN 0.617 XLDB-EN-EN-02-T-DN 0.5705 DCU-EN-EN-02-D 0.5513‡ IIIT-H-EN-EN 0.2224 ‡statistically significant difference (α=0.01) from the value of the run in the next row Table 3: Best GeoTime English Run per Team The most interesting result from this table is that Berkeley had better cross-lingual performance than its monolingual runs. This phenomenon appears occasion- 200 ally in Cross-Language Information Retrieval when blind feedback obtains additional discriminating terms from the top retrieved documents of an initial retrieval (BRKLY used blind feedback as a baseline [Larson 2010] without geotemporal extensions).. Another way to compare performance is to fix the run type, for example to compare runs which all teams used only the D (description) part of the topic in their runs. The following table compares description only runs against the Japanese collection . RUN nDCG HU-KB-JA-JA-03-D 0.5881† KOLIS-JA-JA-04-D 0.5159† Anon2-EN-JA-01-T 0.4231 M-JA-JA-03-D 0.3982 FORST-JA-JA-04-D 0.3772 OKSAT-JA-JA-01-D 0.3138 BRKLY-JA-JA-02-D 0.3014 Anon-JA-JA-02-UNK 0.2085 Table 4: Best Japanese D Run per Team (nDCG) † statistically significant difference (α=0.05) from the value of the run in the next row The interesting thing to immediately observe is that BRKLY which did so well in English runs comes in at a relatively low performance using the same blind feedback methodology as for English. Indeed, if we further exclude the anonymous runs (including M) for which we have no methodology , Berkeley's performance is worst among of- ficial Japanese runs. The reason for this has yet to emerge, however, all the other Japanese groups except OKSAT utilized sophisticated geotemporal processing in their approaches to retrieval. 4 Technical Approaches to Geo-Temporal Retrieval In this section we review the technical approaches taken by the best performing teams. 4.1 English Approaches A wide variety of approaches were utilized by the differ- ent groups. The most conventional was BRKLY’s baseline approach of only doing probabilistic ranking coupled with blind relevance feedback. This worked very well for English, but for Japanese it substantially under- performed the approaches by other teams which submitted Japanese runs. Several groups (DCU from Dublin City University, Ireland, IIT-H of Hyderabad, India, and XLDB of University of Lisbon) primarily utilized geo- graphic enhancements (although XLDB did consult DB- pedia as an external resource using a timestamp) and did not perform as well as groups which tackled the temporal qualities of the retrieval. A more elaborate approach was taken by the INESC group from Lisbon, Portugal who utilized a geographic re- source (Yahoo PlaceMaker) for extracting geographic ex- pressions and the TIMEXTAG1 system from the Univer- sity of Amsterdam for locating temporal expressions from 1http://ilps.science.uva.nl/resources/timextag within both topic and documents. Document processing was done at both the document and sentence level. Their hybrid approach relied upon the maximum amount of se- mantic content from the topic, so they utilized both de- scription and narrative components from each topic. Uni- versity of Iowa utilized a hybrid approach which com- bined probabilistic and (weighted) Boolean query formu- lation. 4.2 Japanese Approaches The most straightforward of these geotemporal ap- proaches was the KOLIS system of Keio University which merely counted the number of geographic and tem- poral expressions found in top-ranked documents of an initial search and then re-ranked based upon initial prob- ability coupled with weighting of the counts. HU-KB of Hokkaido University , similarly to the University of Iowa for English, also combined probabilistic and Boolean query formulation [Mori 2010]. However, in the case of Hokkaido, the Boolean approach was utilized to filter out unwanted documents from the probabilistic ranking. In order to deal with the Boolean tendency to return the null set, HU-KB expanded the vocabulary using a synonym thesaurus. The FORST group of Yokahama University [Yoshioka 2010] used question decomposition to separate out temporal from locational aspects of the topics in order to apply standard factoid question-answering techniques which work well on a single question type (when or where). While KOLIS utilized a custom gazetteer of place names and a fixed list of temporal expressions (not including day-of-week), the Hokkaido [HU-KB] ap- proach used the Cabocha system for named entity tagging [Kudo and Matsumoto 2002]. 5 Topic Difficulty There are two methods of assessing topic difficulty: look- ing at average performance over all runs by topic – the topics with low average precision are assumed to be the most difficult. The other way is to examine differences between median performance and maximum performance – this can demonstrate that particular methods perform better for such topics. 5.1 Topic Difficulty by Average Precision Figures 1 and 2 average the three performance meas- ures over all submitted runs and plot this average by topic. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 21 18 22 15 10 12 20 5 11 25 17 4 24 2 14 9 16 23 13 6 8 19 3 7 1 AP Q nDCG Figure 1: Per-topic AP, Q and nDCG averaged over 25 English runs for 25 topics (pool depth 100), sorted by topic difficulty (AP ascending) 201 The data are sorted by average precision in order to more clearly identify which topics presented the most challenge to successful search. From the point of view of search of the English NYT collection, the four most difficult topics (less than 0.1 overall average precision) seem to be topic 15 (What American football team won the Superbowl in 2002, and where was the game played?), topic 18 (What date was a country was invaded by the United States in 2002?), topic 21 (When and where were the 2010 Winter Olympics host city location announced?) and topic 22 (When and where did a massive earthquake occur in December 2003?) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1814252220121523131124 4 2 8 191016 5 21 6 1 9 3 7 AP Q nDCG Figure 2: Per-topic AP, Q and nDCG averaged over 34 Japanese runs for 24 topics (pool depth 100), sorted by topic difficulty (AP ascending) With respect to Japanese search of the Mainichi col- lection, several other topics (12, 14, and 25) also had aver- age precision below 0.1 while topic 23 searches averaged 0.129. Topic 12 is When and where did Yasser Arafat die?, Topic 14 is When and where did a volcano erupt in Africa during 2002?, Topic 23 is When did the largest ex- pansion of the European Union take place, and which countries became members?, and Topic 25, the one pre- dicted by the organizers to be difficult: How long after the Sumatra earthquake did the tsunami hit Sri Lanka? 5.2 Median/Maximum Topic Peformance Another way to assess performance is to examine indi- vidual performance variability across topics. Such per- formance can be displayed by taking individual topic runs and finding the minimum, median and maximum perform- ance for that topic. These are displayed in Figures 3 (English runs) and 4 (Japanese runs). While for nearly all Japanese topics, at least one group had a minimum preci- sion of near zero for that topic, there was still a wide vari- ability of performance from both minimum to median av- erage precision for a topic, as well as from median preci- sion to maximum precision for a topic. Where the median and maximum are very close, we can infer that almost all groups had good performance. Figure 3: Per-topic AP showing Minimum, Median and Maximum performance for English runs An example for English where median and maximum are almost identical is topic 19: When and where did the fu- neral of Queen Elizabeth (the Queen Mother) take place? An example where the best run (UIOWA-EN-03-DN, maximum AP 0.7889) is more than four times better than the median (0.177) is for topic 25: How long after the Sumatra earthquake did the tsunami hit Sri Lanka? An example (for Japanese) where median and max- imum are almost identical is topic 7: How old was Max Schmeling when he died and where did he die? On the other hand, topic 19, which showed almost no variation between median and maximum for English, becomes, for Japanese, an example where the maximum precision (1.000, run FORST-JA-JA-02-D) is more than 7 times better than the median precision (0.1339). Figure 4: Per-topic AP showing Minimum, Median and Maximum performance for Japanese runs It is worth noting that the minimum for Japanese was a single run in which the team did very poorly on all top- ics. It should probably be considered an outlier and re- moved from future analysis. The median performance is a more reliable statistic from which to draw conclusions. 1234567891 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 2 2 2 3 2 4 2 5 0 0.2 0.4 0.6 0.8 1 1.2 English MAP Performance by Topic Median Minimum Maximum GeoTime Topic M ea n A ve ra ge P re ci si on 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 8 1 9 2 0 2 1 2 2 2 3 2 4 2 5 0 0.2 0.4 0.6 0.8 1 1.2 Japanese MAP performance by topic Median Minimum Maximum GeoTime Topic M ea n A ve ra ge P re ci si on 202 6 Judgment Approaches for Imprecise Temporal Expressions One of the difficulties in relevance judgment is how to ap- proach the extreme variability in temporal expressions in text and how to approach judgment, particularly with re- spect to these expressions. As a point of reference, each document had a specific date upon which it was published. At least for English relevance judgments, imprecise expres- sions relating to that date were seen as sufficient evidence to judge a document relevant. For example if a document stated “Katherine Hepburn died Wednesday in her home in Connecticut” it was assumed that sophisticated natural lan- guage processing could infer the exact date of death from the date of the document. If a document stated (for topic 25) that “a few hours later the Sumatra earthquake tsunami hit the coast of Sri Lanka” the document could be judged relevant. Finally, we retrospectively realized that a topic needs to be date stamped if it asks a temporally relative question. For example topic 16: When and where were the last three Winter Olympics held? was formulated before the 2010 winter Olympics were held in Vancouver. Thus while documents could have known that the 2010 Winter Olympics were to be held in Vancouver, the correct answer (for a topic date-stamped before 2010) would be 1998 (Nagano, Japan), 2002 (Salt Lake City, USA), and 2006 (Turin, Italy). 7 Discussion 7.1 Lessons Learned NTCIR-GeoTime was the first attempt at evaluating geotemporal information retrieval. While Geographic In- formation Retrieval has had numerous evaluations, the ad- dition of a temporal component has proven very challen- ging to participants, especially if the topic (question) can be misinterpreted by the automated retrieval process (as in the case of topic 21: When and where were the 2010 Winter Olympics host city location announced?) or re- quire a list answer which is time varying (topic 16: When and where were the last three Winter Olympics held?). Teams which relied exclusively on geographic enhance- ments did not perform as well as those which incorporated some temporal expression processing within their method- ologies. Questions remain as to why there was so much performance variability across document collection lan- guage (Japanese and English) for the same topics. 7.2 Future Directions Plans are already being formulated for a second GeoTime evaluation for the NTCIR-9 Workshop in 2011. We are exploring additional languages – Korean and Chinese to the document collection set. For participant groups we will make available a standard set of resources (gazetteers, named entity taggers, TimexTag, etc). In addition, we have a definite desire to evaluate location-based and map- based search simulation, i.e. “What event is happening "here" and "now/tomorrow"” -- where here and now come from the included latitude/longitude coordinates. This should facilitate innovative result visualization using Google/MS Earth/map as well as map-based querying (bounding rectangles). 8 Acknowledgments We thank participant assessors Christopher Harris (Uni- versity of Iowa), Krishna Janakiraman (UC Berkeley), Ricardo Vaz and Flávio Esteves (Technical University of Lisbon). Appreciation is also due to Jorge Machado (IN- ESC, Lisbon, Portugal) who programmed the English as- sessment system, and Tetsuya Sakai (Microsoft Research, Asia) who ran the evaluation statistics. The work of the first author was supported by a Visiting Researcher travel grant from the National Institute of Informatics of Japan during June-July 2010. We thank the LWA/WIR review- ers for suggesting improvements and corrections to the original submission. 9 References [Alonso, Gertz and Baeza-Yates 2007] On the Value of Temporal Information in Information Retrieval, SIGIR Forum, Vol. 41 No. 2 December 2007, pp 35-41. [Asadi et al., 2005] S Asadi, , C.-Y. Chang, X. Zhou, and J. Diederich. Searching the world wide web for local services and facilities: A review on the patterns of loca- tion-based queries. In W. Fan, Z. Wu, and J. Yang, ed- itors, WAIM2005, pp. 91–101. Springer LNCS 3739, 2005. [Gey et al 2010] F. Gey, R. Larson, N. Kando, J. Machado and T. Sakai, NTCIR-GeoTime Overview: Evaluating Geographic and Temporal Search, In Proceedings of the 8th NTCIR Workshop Meeting , Tokyo Japan June 15-18, 2010, ISBN: 978-4-86049-053-9. [Harris 2010] C. Harris, Geographic Information Retrieval Involving Temporal Components, in Proceedings of the 8th NTCIR Workshop Meeting. [Hill 2006] L L Hill, GeoReferencing: The Geographic Associations of Information, MIT Press, Cambridge, MA 2006. [Jones et al 2004] C. B. Jones, A. I. Abdelmoty, D. Finch, G. Fu, and S. Vaid. The SPIRIT spatial search engine: architecture, ontologies and spatial indexing. In GiS- cience 2004, Oct. 2004, Adelphi, MD, pages 125–139, 2004. Cunningham [Kishida 2010] K. Kishida, Vocabulary-based Re-ranking for Geographic and Temporal Searching at NTCIR GeoTime Task, in Proceedings of the 8th NTCIR Workshop Meeting . [Kudo and Matusomoto 2002] T. Kudo and Y. Mat- sumoto, Japanese Dependency Analysis using Cas- caded Chunking, in CoNLL 2002, Taipei. [Larson 1996] R. Larson, Geographic information retriev- al and spatial browsing. In GIS and Libraries: Patrons, Maps and Spatial Information, pages 81–124. UIUC - GSLIS, Urbana-Champaign, IL, 1996. [Larson 2010] R Larson, Text Retrieval Baseline for NT- CIR-GeoTime, in Proceedings of the 8th NTCIR Workshop Meeting . [Machado, Borbinha and Martins 2010] J. Machado, J. Borbinha and B. Martins, Experiments with Geo-Tem- poral Expressions Filtering and Query Expansion at Document and Phrase Context Resolution, In Proceedings of the 8th NTCIR Workshop Meeting . 203 [Mani, Pustejovsky and Sundheim 2004] I. Mani, J. Puste- jovsky, and B. Sundheim. Introduction to the special is- sue on temporal information processing. ACM Transac- tions on Asian Language Information Processing (TALIP), 3(1):1–10, 2004 [Mandl et al 2008] T. Mandl, F. Gey, G. Di Nunzio, N. Ferro, M. Sanderson, D. Santos and C. Womser-Hack- er, An Evaluation Resource for Geographic Information Retrieval, In Proceedings of the Sixth International Language Resources and Evaluation (LREC'08) Mo- rocco, May, 2008 [Mitamura et al 2010] T Mitamura, H Shima, T Sakai, N Kando, T Mori, K Takeda, C-Y Lin, R Song, C-J Lin, C- W Lee, Overview of the NTCIR-8 ACLIA Tasks: Ad- vanced Cross-Lingual Information Access, in Proceedings of the NTICIR Workshop 8, Tokyo Japan June 15-18, 2010. [Mori 2010] T. Mori, A Method for GeoTime Information Retrieval based on Question Decomposition and Ques- tion Answering, in Proceedings of the 8th NTCIR Workshop Meeting. [Purves and Clough 2010] R. Purves, C. Jones, and P. Clough. GIR’10: 6th workshop on geographic informa- tion retrieval, 2010. http://www.geo.unizh.ch/ rsp/gir10/index.html. [Santos et al 2010] D. Santos, L. Cabara, et al, GikiCLEF: Crosscultural Issues in Multilingual Information Access in Proceedings of LREC 2010, Malta, May 2010. [Sakai et al 2010] T. Sakai, H. Shima, N. Kando, R. Song, C-J. Lin, T. Mitamura, M. Sugimito, C-W. Lee Overview of NTCIR-8 ACLIA IR4QA, in Proceedings of the 8th NTICIR Workshop Meeting, Tokyo Japan June 15-18, 2010. [Seki et al 2010] Y. Seki, L-W Ku, L. Sun, H-H. Chen and N. Kando,, Overview of Multilingual Opinion Ana- lysis Task at NTCIR-8: A Step Toward Cross Lingual Opinion Analysis, in Proceedings of the 8th NTICIR Workshop Meeting, Tokyo Japan June 15-18, 2010. [Yoshioka 2010] M. Yoshioka, A Method for GeoTime Information Retrieval based on Question Decomposi- tion and Question Answering, In Proceedings of the 8th NTICIR Workshop Meeting, Tokyo Japan June 15-18, 2010. 204 Image Retrieval on Mobile Devices Adrian Hub University of Bamberg D-96052, Bamberg, Germany adrian.hub@uni-bamberg.de Abstract With the growing number of mobile devices and the access possibility to thousands of images from these devices, the users call for efficient im- age search techniques for mobile devices. Desk- top paradigms cannot be used with the smaller screen sizes, hence it is needful to offer alterna- tive searching and browsing strategies, which are adapted for mobile devices. In this paper we de- scribe our ideas how image retrieval on mobile devices can be accomplished. 1 Introduction The amount of small mobile devices, which we use in our every day life, grows constantly. There are cell phones, smart phones, tablet PCs, netbooks and so on. Most of them are equipped with more or less powerful cameras and all of them offer enough storage capacity to take and store a lot of photos on the device itself. Furthermore the Internet is easily accessible via broad- band connections offering access to an unlimited number of images. There are image search services like Google im- ages1, photo communities like Flickr2 or Picasa3 and social networks like Facebook4. All of them let the user search for thousands of images and the latter ones let him especially browse through the photos of friends and colleagues. Due to the close integration of social networks, the ac- cess to the Internet and the huge storage capacity on mobile devices, a lot of own and external (from the web) photos are stored on the device or at least accessible. Unfortunately, most mobile devices are usually not designed to manage thousands of photos concerning the small screen size and limited control possibilities. The typical thumbnail view is inapplicable, as if the device should give a good overview with thumbnails, they had to be tiny making recognition of images very difficult or if the recognition with bigger thumbnails could be good, there can be placed only few on the small screen of the device regarding their typical screen sizes less than 4 inches. Another drawback could be the lack of a keyboard, which makes tagging and searching for tags quite hard. As our measurements with Flickr crawls showed, a lot of im- ages remain untagged (about 40% in this ‘classical envi- ronment’), so tagging on mobile devices will probably be 1http://www.google.com/imghp 2http://www.flickr.com/ 3http://picasaweb.google.com/ 4http://www.facebook.com/ even less used. One more reason why image retrieval on mobile devices is a challenging topic. Hence, we presented a hybrid system, called Pi- cadomo [Hub et al., 2009], that makes use of Hierarchical Faceted Search (HFS) [Hearst et al., 2002] for the purpose of image retrieval and is adapted for the small screen size of mobile devices. It is based on our visual faceted search for desktop PCs VisualFlamenco [Mu¨ller et al., 2008] and combines a tag-based search, search techniques based on EXIF data as well as Content-Based Image Retrieval (e.g. low-level visual features) to generate a good browsing ex- perience and help the user for finding desired images on mobile devices (see Fig. 1). Figure 1: Screenshot of our Prototype We now want to improve our prototype for mobile de- vices in different ways. First we want to address more im- age sources. Picadomo could handle only images stored on 205 the internal memory card, due to the fact, that the feature extraction takes some time and cannot be done on the fly. Therefore it is essential to use server and/or peer-to-peer networks to broaden the searchable image sources. Second, the facets used to search the images can be im- proved further. Our user experiments in [Hub et al., 2009] showed, which facets were used often and which facets could be neglected. But it often depends on the user and/or the dataset, e.g. EXIF data like the time or place the picture was taken may be only interesting for own content, but not for images from the Internet. Third, the search user interface can be enhanced in many ways. There are a lot of techniques in text retrieval, re- garding the query specification, presentation of results, query reformulation, personalization and visualization, just to name a few. Some of them can be adopted for image retrieval as well, e.g. keyword-in-context (KWIC) views present extracted query terms along with other kinds of in- formation (such as document title) about a specific search result. Similar to this, a selected color facet can be shown as overlay on image search results. The paper is organized as follows: In the next section we briefly describe related work, in Section 3 we discuss dif- ferent image sources along with their problems and meth- ods of resolutions. Section 4 describes the two main search strategies the user typically is faced with. In Section 5 we discuss some user interface enhancements before we sum- marize our work in Section 6. 2 Related Work Most of the existing applications on mobile devices browse images just by folder, some let the user assign tags to im- ages or classify them in albums and few let the user browse multiple photo albums from Facebook, Picasa, Flickr and your memory card. The JustPictures [Quillard, 2010] application allows the user to browse the above mentioned photo albums on the web, it shows EXIF data, it automatically notifies the user of album updates, can handle authentication of users to ac- cess private albums and many more (see Figure 2 for the album view of this application). But the user has still to know where to look for an image, as there is no search interface. JustPictures shows EXIF data for photos, but there is no possibility to browse the photos by one selected EXIF feature. Another way to organize personal photo collections on mobile devices is using time information as main order- ing criterion for visualization and interaction [Harada et al., 2004]. But time information should not be the only aspect, since browsing and searching by facets can offer much more possibilities. A multi-faceted image search and browsing system, named Scenique [Bartolini, 2009], allows the user to man- age photo collections by using both visual features and tags, possibly organized into multiple dimensions (see Fig- ure 3). But this solution is not designed for mobile devices, as it was made for desktop-PCs and therefore requires a screen with higher resolution. Within this section, we only describe some of the exist- ing applications that deal with the administration of image collections on mobile devices. Mor Naaman et al. give a more detailed overview in [Naaman et al., 2008]. Figure 2: JustPictures in album view 3 Feature Database Our approach relies on extracted features, that are stored in some kind of database and are accessible through the mo- bile device. Obviously, every image needs to be processed with our feature extractor to make it findable on the mobile device. This is no problem for images on the internal mem- ory, as their features can just be stored locally on the mobile device, after they have been extracted. But this could get a little bit more challenging for external image sources, like the web or social network sites. Therefore we need to dif- ferentiate between local images, images from friends (e.g. Facebook or Flickr) and external sources (e.g. Google im- ages or any web page). 3.1 Server Database If we want to use our faceted search for searching im- ages on the Internet, the best solution would probably be a server, that stores the facet data. The server can crawl the Internet and extract the features for given images. The database could then contain all facet data along with a re- sized version of the image and an URL, where the original image can be found. The mobile device obtains this infor- mation from the server to offer the results to the user. Another possibility could be the cooperation with photo album websites. Whenever users upload their images, the features could be extracted and offered via an interface to the mobile devices, to make the images easily searchable. Obviously, both solution require server power, for ex- tracting and storing the facets. To avoid these costs a peer- to-peer (P2P) solution may come in handy. 3.2 Peer-to-Peer Database Another approach to store the facet data is to share it in a peer-to-peer environment. All users of the application can 206 Figure 3: Scenique browsing interface define their local photos as private, public or only view- able for friends. Depending on these settings and the rela- tionship between the users they are able to browse public albums or photos of their friends using a P2P connection. The process of feature extraction and the storage of facet data happens all on the mobile device. If a user starts the search for an image, the desired feature set will be ex- changed using the P2P-network. Matching image results will then be forwarded to the searching user. A JXTA-based implementation of a similar scenario was presented in [Mu¨ller et al., 2007]. 3.3 Local Database If a user wants to use our faceted browsing application for browsing through her friend’s photos, but her friends don’t use the application or even don’t have a mobile device, there is another possible solution. The user could simply mark some albums of her friends in our application as fa- vorite albums, to indicate, that she wants to make these albums searchable via faceted browsing. Our application then processes these albums in the background, extracting the image features and stores them on the local memory. That way the user can search her own and the favorite al- bums of her friends with our faceted browsing application. 4 Search Strategies When searching for images, there are generally two differ- ent use cases. First, recover images that the user has al- ready seen before, e.g. photos taken with the camera of the mobile device, his own web albums or an image of a web album of any friend. Second, discover new images that the user hasen’t seen before, e.g. images from the Internet or recently added photos of any web album of a friend. The search strategies for the two use cases may vary in detail. 4.1 Recover Obviously the easier task is to recover an already seen im- age. The user may have a rough imagination of the image content, so we can use content-based image retrieval. As an example, the dominant color of a (region of an) image can help finding the image again. Other possibilities are the contrast of the image, texture or orientational features. Any known metadata of an image can help finding it. EXIF features like the date or place the photo was taken, the camera model or if the photo was taken with or without flash allow simple browsing and searching. These facets can be used best for the user’s own photos or for profes- sional users. If the user has his photos tagged, a tag based search can offer another possibility to search within photos. Due to the fact, that a lot of images remain untagged, this may be useful for only a small number of images. Another option for searching images can be based on GPS data. If GPS data are available for a given set of photos, these photos can be placed on a map according to the place the photos were taken. In doing so, the user can easily search for photos of famous places via zooming and scrolling in a world map. 4.2 Discover As a matter of principle, all approaches for recover images can also be used to discover unseen images. The expec- tation of the user regarding the resulting images is not as detailed given in this case. Hence, it depends on the indi- vidual use case, if it is an easier task to search for unseen images or not. In addition the search for unseen images should support a key word query based on user input, as used for example on Google images. The result set of a key word query can then be further searched using our facets. 5 Search Interface Regarding the search interface we want to describe in short how we could enhance the user interface and the presenta- tion of results. 5.1 User interface The search interface may be the most important part for searching and browsing through images. For a detailed il- lustration of our interface, see the prototype in [Hub et al., 2009]. According to the user’s feedback some improve- ments can be made. A lot of users asked for a timeline view to arrange the images chronologically. With individ- ual time ranges this could indeed be a good way to search images, e.g. if the user wants to see the images of her last summer holiday, she can adjust the time range from Au- gust to September and only photos taken during that time will be shown. Furthermore, our user experiments showed, that the number of color facets can be reduced. The fine grained color shades of our prototype were not used very often. We believe that ten to twelve main color shades are enough (see color picker at Google images). Other unused facets like the contrast for global an local regions can be reduced to only global contrast features. Most of modern mobile devices are equipped with mo- tion sensors, so it could be useful to use this for naviga- tional purpose. We could imagine to scroll through a result list with pitching the mobile device. If the user shakes the device, the search could be reseted or the last step could be undone. Alternatively multitouch gestures can be used for scrolling, zooming and navigation through search history. 5.2 Presentation of Search Results Another very important aspect when searching for images is the presentation of search results. A simple list view is used most of the time, as usual with web search engines. Regarding text retrieval, theses lists can be enhanced with various improvements. There are summary information for every hit, query term highlighting, sparklines, preview of 207 document content and many more (see [Hearst, 2009] for more information). Some of the well-established techniques for text retrieval can be adjusted and adapted for image retrieval. For exam- ple, the global dominant color of an image can be repre- sented with a colored frame around the image dyed in the corresponding color. This can also be used as overlay for some regions of an image, to indicate the selected color feature similar to query term highlighting. The context of an image can also easily displayed. If the result image comes from a webpage or an web album, the previous and next image can be presented as thumbnail along with the result image. This gives the user a quick and small overview of the result and she can recognize, if there are more similar images in one continuous photo stream or if it is a collection of totally different images, what could come in handy for image search on web pages. A totally different way to present the search result could be the arrangement of the result set as an three dimen- sional image globe. The images can be arranged on a vir- tual globe according to their dominant color using the HSV color model [Zhang et al., 1999]. The north pole is for light images with high values and the south pole for dark images with low values. The hues are grouped around the equator and all images of the result set are positioned where they fit best. The saturation can be neglected in this model. The navigation could be done easily by pitching the mobile de- vice, to let the image globe roll. For big result sets, the user can zoom in to show more or bigger images. 6 Conclusion In this paper we described some ideas how image retrieval on mobile devices can be improved. Our first prototype should be enhanced with the presented features. We be- lieve that this is a useful approach for finding images from webpages, online albums and internal storage. Besides the realization of mentioned features in our pro- totype the search for partners could be very helpful, to in- clude the faceted image search into existing photo sharing communities. References [Bartolini, 2009] Ilaria Bartolini. A mulit-faceted brows- ing interface for digital photo collections. CBMI, 7:237– 242, 2009. [Harada et al., 2004] Susumu Harada, Mor Naaman, Yee Jiun Song, QianYing Wang, and Andreas Paepcke. Lost in memories: interacting with photo collec- tions on pdas. In JCDL ’04: Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, pages 325–333, New York, NY, USA, 2004. ACM. [Hearst et al., 2002] Marti Hearst, Ame Elliott, Jennifer English, Rashmi Sinha, Kirsten Swearingen, and Ka- Ping Yee. Finding the Flow in Web Site Search. Com- mun. ACM, 45(9):42–49, 2002. [Hearst, 2009] Marti A. Hearst. Search User Interfaces. Cambridge Univeristy Press, 2009. [Hub et al., 2009] Adrian Hub, Daniel Blank, Wolfgang Mu¨ller, and Andreas Henrich. Picadomo: Faceted image retrieval for mobile devices. CBMI, 7:249–254, 2009. [Mu¨ller et al., 2007] Wolfgang Mu¨ller, Soufyane El Allali, Daniel Blank, Andreas Henrich, and Thomas Lauter- bach. Hunt the Cluster: A Scalable, Interactive Time Bayesian Image Browser for P2P Networks. Multi- media Workshops, International Symposium on, 0:317– 322, 2007. [Mu¨ller et al., 2008] Wolfgang Mu¨ller, Markus Zech, An- dreas Henrich, and Daniel Blank. VisualFlamenco: De- pendable, Interactive Image Browsing Based on Visual Properties. CBMI International Workshop on Content- Based Multimedia Indexing, 2008. [Naaman et al., 2008] Mor Naaman, Rahul Nair, and Vlad Kaplun. Photos on the go: a mobile application case study. In CHI ’08: Proceeding of the twenty-sixth an- nual SIGCHI conference on Human factors in comput- ing systems, pages 1739–1748, New York, NY, USA, 2008. ACM. [Quillard, 2010] Gregory Quillard. Justpictures. http://code.google.com/p/justpictures/, 2010. [Zhang et al., 1999] L. Zhang, F. Lin, and B. Zhang. A CBIR Method Based on Color-Spatial Feature. IEEE Region 10 Annual International Conference, pages 166– 169, 1999. 208 Named Entity Disambiguation for German News Articles Andreas Lommatzsch, Danuta Ploch, Ernesto William De Luca, Sahin Albayrak DAI-Labor, TU Berlin, 10587 Berlin, Ernst-Reuter-Platz 7 {andreas.lommatzsch, danuta.ploch, ernesto.deluca, sahin.albayrak}@dai-labor.de Abstract Named entity disambiguation has become an im- portant research area providing the basis for im- proving search engine precision and for enabling semantic search. Current approaches for the named entity disambiguation are usually based on exploiting structured semantic and lingual re- sources (e.g. WordNet, DBpedia). Unfortu- nately, each of these resources cover indepen- dently from each other insufficient information for the task of named entity disambiguation. On the one hand WordNet comprises a relative small number of named entities while on the other hand DBpedia provides only little context for named entities. Our approach is based on the use of multi-lingual Wikipedia data. We show how the combination of multi-lingual resources can be used for named entity disambiguation. Based on a German and an English document corpus, we evaluate various similarity measures and algo- rithms for extracting data for named entity dis- ambiguation. We show that the intelligent filter- ing of context data and the combination of multi- lingual information provides high quality named entity disambiguation results. 1 Introduction Named entity recognition (NER) and named entity disam- biguation (NED) are usually seen as a subtask of informa- tion extraction aiming to identify and classify text elements into predefined categories, such as name of persons, orga- nizations, or locations. Ambiguous names are resolved by determining the correct referent. Data provided by NER and NED is required for various applications reaching from semantic search and clustering to automatic summarization and translation. The process of named entity recognition and disambiguation usually consists of three steps: 1. Identify words (or groups of words) representing an entity in a given text 2. Collect data for describing the identified entity in de- tail (entity properties, the relationship to other entities) 3. Classify the identified entity and calculate which candidate entity (from an external knowledge base) matches best (for resolving ambiguity). The most complex step in the disambiguation process is the collection of metadata for the identified entity de- scribing the entity context. Many systems use linguistic grammar-based techniques as well as statistical models for NED. Manually created grammar-based systems often ob- tain a high precision, but show a lower recall and require months of work by linguists. Other systems are based on deploying dictionaries and lexical resources, such as Word- Net [Fellbaum1998] or DBpedia [Lehmann et al.2009]. In this paper we evaluate how comprehensive, multi- lingual resources (such as Wikipedia) can be deployed for NED. Based on the context of identified entities and infor- mation retrieved from Wikipedia we disambiguate entities and show which parameter configurations provide the best accuracy. The paper is organized as follows. The next section de- scribes the current state of the art and presents the most popular methods for NED. Subsequently, we introduce our approach. Based on the context of identified entities and data retrieved from Wikipedia we use text similarity mea- sures to perform NED. The approach is analyzed on a col- lection of German news articles and on the Kulkarni name corpus1. The evaluation shows that our approach provides high quality results. Finally we draw a conclusion and give an outlook to future work. 2 Fundamentals The concept of named entities (NE) was first introduced by the Message Understanding Conferences (MUC) eval- uations [Sundheim1996]. The entities were limited to nu- merical expressions and a few classes of names. However, along with the development of information extraction and search technologies, the categories used for NEs were ex- tended. Nowadays, complex ontologies (such as the Sug- gested Upper Merged Ontology, containing about 1,100 most general concepts [Pease and Niles2002]) are consid- ered as the basis for named entities. Current strategies for named entity recognition and dis- ambiguation are mostly based on the use of ontological data and on context analysis. Since corpus based approaches are likely to suffer from the data sparseness problem, large lex- ical data collections, such as Wikipedia are a good choice to mine concepts and to enrich sparse corpora. Wikipedia is the largest online encyclopedia which provides linked and partially annotated data and descriptive information. Based on the evaluation of articles types (e.g. disambiguation pages, category articles) and the analysis of templates (e.g. info boxes) even semantic knowledge can be extracted. A popular project aiming to provide structured information from Wikipedia is DBpedia [Lehmann et al.2009]. Some authors [Cui et al.2009] built domain specific taxonomies from Wikipedia by analyzing the URLs and internal links 1http://www.d.umn.edu/˜pederse/namedata.html 209 in Wikipedia pages, such as category labels or info boxes. For extracting new facts from Wikipedia Wu & Weld [Wu and Weld2007] suggest a cascade of conditional random field models. Beside the approaches based on structured, ontologi- cal knowledge, there are projects that use unstructured Wikipedia data for NED. Cucerzan [Cucerzan2007] re- trieves for words (“surface forms”) identified to represent relevant entities all potentially matching Wikipedia arti- cles. The Wikipedia contexts that occur in the document and the category tags are aggregated into a string vector, which is subsequently compared with the Wikipedia entity vector (of categories and contexts) of each possible entity. Then the assignment of entities to surface forms is chosen that maximizes the similarity between the document vector and the entity vectors. Alternative approaches for named entity disambiguation are based on the co-occurring anal- ysis of named entities [Nguyen and Cao2008], the analy- sis of word based features (e.g. part-of speech, pattern) [Mann and Yarowsky2003] or document meta-information [Nadeau and Sekine2007,Bunescu and Pasca2006]. 3 Approach In our approach we focus on multi-lingual data retrieved from Wikipedia. Our intention is to have a robust ap- proach providing a constantly high disambiguation preci- sion. Thus, we do not rely on semantic annotations in Wikipedia (e.g. info boxes) or DBpedia content due to the fact that this data is not available for all entities. The developed NED component is part of a project2 for the semantic clustering of news, implemented using the UIMA framework3. The identification of words repre- senting potentially relevant entities is done using DBpedia data. The component searches for surnames in news arti- cles (present in the DBpedia person dataset). The found surnames and the assigned DBpedia entities are used as in- put data for the disambiguation task. For performing NED, we analyze various methods to retrieve context data for the potentially matching entities. Due to the fact that we perform the named entity recog- nition based on DBpedia we focus on Wikipedia as data source. We analyze the following four content extraction strategies: • We extract the first section of theWikipedia article and remove the stop words. Usually the first section con- tains the most important information of the respective article. • We extract the first section of theWikipedia article and remove the stop words. The whole text is converted to lower case characters when calculating the string similarity. • We extract all words with capital letters from the first section of the Wikipedia article. This is done to re- strict the content to proper nouns (stop words are re- moved). • We extract all words of the (complete) Wikipedia ar- ticle that link to Wikipedia articles. The idea behind this method is, that the linked words contain the data suitable for the NED. 2http://www.dai-lab.de/competence centers/irml/projekte/spiga/ 3http://uima.apache.org/ We extract these data from the German as well as from the English Wikipedia. We use language-specific stop word lists and rule based stemmers. In the next step we calculate the similarity between the retrieved Wikipedia content and the analyzed news article. The similarity calculation is done with the following algo- rithms: 1. Jaccard-Similarity: The Jaccard Similarity calcu- lates the similarity of two word vectors (X,Y ) as fol- lows: Jaccard(X,Y ) = X ∗ Y |X| |Y | − (X ∗ Y ) where (X ∗ Y ) is the inner product of X and Y, and |X| = √X ∗X the Euclidean norm of X . 2. Dice-Similarity: The Dice coefficient is a term based similarity measure whereby the similarity measure is defined as twice the number of terms common to com- pared entities divided by the total number of terms in both tested entities. The coefficient result of 1 indi- cates identical vectors as where a 0 equals orthogonal vectors. Dices coefficient(X,Y ) = 2 ∗ #common terms #terms(X)+ #terms(Y) 3. Overlap-Coefficient: The Overlap-Coefficient calcu- lates the similarity based on the common terms: Overlap(X,Y ) = #common terms(X,Y ) min (#terms(X),#terms(Y )) 4. Weighted Term-Similarity: Similarity based on the number of matched terms weighted by the term length and the percentage of matched terms. wTSim(X,Y ) = √ |T | ∗ |X| ∗∑ t∈T ( #(tinY ) ( 1 + log ( 1 1 + #t (inY ) ))2 len(t) ) where T is the set of common terms in X and Y and len(t) the function that calculates the length of term t. The weighted term similarity is often used by search engines for calculating a relevance score. We calculate for each entity the similarity between the re- trieved content and the news article. For the disambigua- tion task we determine the entity with the highest similarity. Overall, we test 32 different variants (4 content extraction strategies × 4 similarity measures × 2 languages). Addi- tionally, we create ensembles combining the results of dif- ferent languages (based on the CombSUM algorithm [Hull et al.1996]). 4 Experiments and evaluation We evaluate the performance of our approach on a corpus of German news documents as well as on a collection of web search results in English. 4.1 Evaluation on the Kulkarni name corpus The Kulkarni name corpus was created to evaluate NED algorithms. The corpus contains a set of queries where each query consists of a query string, and a set of approxi- mately 200 documents (retrieved from a search engine for the query string). The query strings represent ambiguous person names. The documents in the result set are manually 210 annotated with the DBpedia URL of the entity related to the document. We performed the NED as described in section 3 and calculate the accuracy4 over all queries. We consider only the English Wikipedia as content source since the rel- evant entities in the Kulkarni name corpus are not present in the German Wikipedia. The accuracy for the analyzed content extraction strate- gies and the considered similarity measures are shown in Figure 1. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 DiceSimilarity JaccardSimilarity OverlapCoefficient WeightedTermSimilarity WikipediaLinksEN WikipediaAbstractEN WikipediaLowerCaseAbstractEN WikipediaCapitalWordsAbstractEN Ac cu ra cy Figure 1. The named entity disambiguation accuracy on the Kulkarni name corpus The results show that the disambiguation accuracy highly depends on the retrieved Wikipedia content. The best results are obtained if the disambiguation is based on the linked words in the Wikipedia article. The applied sim- ilarity measure has only a small influence on the result. Overall, the overlap coefficient and the weighted term sim- ilarity provide the best result. Thus, for a further improve- ment of the disambiguation accuracy the content extraction strategies should be optimized e.g. by considering the dif- ferent types of Wikipedia links. 4.2 Evaluation on German news articles Due to the fact, that the focus of our project is to clus- ter German news documents semantically, we create a new corpus optimized on our scenario. We randomly se- lected 65 news articles (from January 2009 crawled from various internet news sources, e.g. Netzzeitung and DiePresse.com) covering the topics politics and sports. Based on the German DBpedia person corpus5 we identi- fied potentially relevant people (having a surname present in the news document). We manually annotated which of the identified entities are related to the news article. Thus, each corpus element consists of the following data: • The news document as plain text (390 words) • The search string (surname of the person), e.g. Schumacher • A list of DBpedia entities having the search string as surname (limited to 10 entities), e.g. [Michael Schumacher, Brad Schumacher, Anton Schumacher, Heinrich Christian ...] For the disambiguation, we considered the German and the English Wikipedia. The disambiguation accuracy for 4accuracy = number of correctly assigned entities total number of entities 5http://downloads.dbpedia.org/3.5.1/de/persondata de.nq.bz2 the considered content extraction strategies and the simi- larity measures are shown in Figure 2. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DiceSimilarity JaccardSimilarity OverlapCoefficient WeightedTermSimilarity WikipediaLinksDE WikipediaLinksEN WikipediaAbstractEN WikipediaLowerCaseAbstractEN WikipediaCapitalWordsAbstractEN WikipediaAbstractDE WikipediaLowerCaseAbstractDE WikipediaCapitalWordsAbstractDE Ac cu ra cy Figure 2. The named entity disambiguation accuracy on the German news corpus. The evaluation shows that the highest accuracy is achieved when using the weighed term similarity and the linked words in the German Wikipedia page as reference content. In the analyzed scenario the similarity measure has a higher impact on the accuracy than the content extrac- tion strategy. As expected, the accuracy achieved using the German Wikipedia content is always above the accuracy achieved based on the English content (since we analyzed German news). Nevertheless, the accuracy based on the analysis of the linked words in the English Wikipedia pro- vided relatively good results. Comparing the dependencies between the accuracy from the content extraction strate- gies, the evaluation shows that good results are obtained if only relevant terms from the content source are filtered. 4.3 Multi-lingual ensembles We analyze how the disambiguation accuracy can be im- proved by combining German and English content extrac- tion strategies. Based on the strategy CombSUM we build ensembles combining the strategies discussed in section 4.2. The ensembles are created incrementally adding in each step the strategy to the ensemble that enables the best accuracy improvement. The weights for each strategy in the ensemble are calculated using an optimization algo- rithm based on genetic algorithms. The evaluation (Figure 3) shows that the weighted combination of strategies ex- tracting data from the German and the English Wikipedia can improve the disambiguation accuracy by 13%. If En- glish Wikipedia articles are used for the NED in German news articles the similarity is implicitly restricted to proper nouns (such as person or location names) that are not trans- lated. Even though the strategies based only on the English Wikipedia content do not show a high disambiguation ac- curacy, in a combination with other strategies they improve the accuracy. 5 Conclusion and future work We analyzed how multi-lingual content retrieved from Wikipedia can be used for NED. For evaluating the algo- rithms on German and English documents, we created a corpus of German news documents. Based on an English and a German corpus we analyzed various string similarity 211 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 (WikipediaLinksDE, WeightedTermSimilarity) [1.00] (WikipediaLinksDE, WeightedTermSimilarity) [0.69] (WikipediaLinksEN, WeightedTermSimilarity) [0.31] (WikipediaLinksDE, WeightedTermSimilarity) [0.68] (WikipediaLinksEN, WeightedTermSimilarity) [0.31] (WikipediaAbstractEN, JaccardSimilarity) [0.01] Ac cu ra cy Figure 3. The disambiguation accuracy of the analyzed ensembles on the German news corpus. measures and several strategies for content extraction. The evaluation results show, that in the chosen scenarios a high disambiguation accuracy can be obtained. The accuracy highly depends on the content extraction strategy. In gen- eral, the more complex extraction strategies provide better results. Moreover, we showed that the combination of multi- lingual data improves the accuracy. The combination of German and English content extraction strategies improves the accuracy up to 13%. The combination of German and English content gives proper nouns a higher weight what results in a better disambiguation accuracy. A deeper anal- ysis how the used parameter settings and the article lan- guage influence the NED accuracy will be done in the near future. As future work we will take a deeper look on the rela- tionships of the relevant named entities. The goal is to inte- grate semantic data sources (such as Freebase6, or YAGO7) and to combine the data with multi-lingual Wikipedia data. We want to analyze the semantic relationships between the entities and learn weights for all relevant types of connec- tions between entities. The learned weighs are used to adapt the similarity measures to the context and the respec- tive domains. Moreover, in the project we will focus on en- semble learning algorithms [Polikar2006] (such as prefer- ence learning [Tsai et al.2007] and boosting strategies [Fre- und and Schapire1996]) to combine multi-lingual data and various features. References [Bunescu and Pasca, 2006] Razvan C. Bunescu and Mar- ius Pasca. Using encyclopedic knowledge for named en- tity disambiguation. In Proc. of the 11th Conference of the EACL). The Assn. for Computer Linguistics, 2006. [Cucerzan, 2007] Silviu Cucerzan. Large-scale named en- tity disambiguation based on Wikipedia data. In Proc. 6http://www.freebase.com/ 7http://www.mpi-inf.mpg.de/yago-naga/yago/ of the 2007 Joint Conf. on Empirical Methods in Nat- ural Language Processing and Computational Natural Language Learning, pages 708–716, Prague, Czech Re- public, June 2007. Assn. for Computational Linguistics. [Cui et al., 2009] Gaoying Cui, Qin Lu, Wenjie Li, and Yirong Chen. Mining concepts from wikipedia for on- tology construction. In Proc. of the 2009 Intl. Joint Conf. on Web Intelligence and Intelligent Agent Tech- nology WI-IAT ’09, pages 287–290, Washington, DC, USA, 2009. IEEE Computer Society. [Fellbaum, 1998] C. Fellbaum. WordNet: An Electroni- cal Lexical Database. The MIT Press, Cambridge, MA, 1998. [Freund and Schapire, 1996] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Intl. Conference on Machine Learning, pages 148– 156, 1996. [Hull et al., 1996] David A. Hull, Jan O. Pedersen, and Hinrich Schu¨tze. Method combination for document fil- tering. In SIGIR ’96: Proc. of the 19th ACM SIGIR conf. on Research and development in information retrieval, pages 279–287, New York, USA, 1996. ACM Press. [Lehmann et al., 2009] Jens Lehmann, Chris Bizer, Georgi Kobilarov, Sren Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann. DBpedia - a crystallization point for the web of data. Journal of Web Semantics, 7(3):154–165, 2009. [Mann and Yarowsky, 2003] Gideon S. Mann and David Yarowsky. Unsupervised personal name disambigua- tion. In Proc. of the 7th conference on Natural language learning at HLT-NAACL 2003, volume Volume 4, pages 33 – 40, Edmonton, Canada, 2003. Assn. for Computa- tional Linguistics Morristown, NJ, USA. [Nadeau and Sekine, 2007] David Nadeau and Satoshi Sekine. A survey of named entity recognition and classi- fication. Linguisticae Investigationes, 30(1):3–26, Jan- uary 2007. Publisher: John Benjamins Publishing Co. [Nguyen and Cao, 2008] H.T. Nguyen and T.H. Cao. Named entity disambiguation on an ontology enriched by wikipedia. In Research, Innovation and Vision for the Future, pages 247 –254, 2008. [Pease and Niles, 2002] Adam Pease and Ian Niles. Ieee standard upper ontology: a progress report. Knowl. Eng. Rev., 17(1):pages 65–70, 2002. [Polikar, 2006] Robi Polikar. Ensemble based systems in decision making. Circuits and Systems Magazine, IEEE, 6(3):21 – 45, 9 2006. 1531-636X. [Sundheim, 1996] Beth M. Sundheim. The message un- derstanding conferences. In Proc. of the TIPSTER Text Program: Phase II, pages 35–37, Vienna, VA, USA, May 1996. Assn. for Computational Linguistics. [Tsai et al., 2007] Ming-Feng Tsai, Tie-Yan Liu, Tao Qin, Hsin-Hsi Chen, and Wei-Ying Ma. FRank: a ranking method with fidelity loss. In SIGIR ’07: Proc. of the 30th ACM SIGIR conf. on Research and development in information retrieval, pages 383–390, New York, USA, 2007. ACM. [Wu and Weld, 2007] Fei Wu and Daniel S. Weld. Au- tonomously semantifying wikipedia. In CIKM ’07: Proc. of the 16th ACM conf. on information and knowl- edge management, pages 41–50, New York, USA, 2007. ACM. 212 Abstract This paper is about an information retrieval evaluation on three different retrieval-supporting services. All three services were designed to compensate typical problems that arise in meta- data-driven Digital Libraries, which are not ade- quately handled by a simple tf-idf based retrieval. The services are: (1) a co-word analysis based query expansion mechanism and re-ranking via (2) Bradfordizing and (3) author centrality. The services are evaluated with relevance assess- ments conducted by 73 information science stu- dents. Since the students are neither information professionals nor domain experts the question of inter-rater agreement is taken into consideration. Two important implications emerge: (1) the in- ter-rater agreement rates were mainly fair to moderate and (2) after a data-cleaning step which erased the assessments with poor agreement rates the evaluation data shows that the three retrieval services returned disjoint but still relevant result sets. 1. Introduction Metadata-driven Digital Libraries (DL) face three typical difficulties: (1) the vagueness between search and index- ing terms, (2) the information overload by the amount of result records returned by information retrieval systems, and (3) the problem that pure term frequency based rank- ings, such as term frequency – inverse document fre- quency (tf-idf), provide results that often do not meet user needs [Mayr et al., 2008]. To overcome these limitations the DFG-funded project “Value-Added Services for In- formation Retrieval” (IRM1) is doing research on an over- all approach to use computational science models as en- hanced search stratagems [Bates, 1990] within a scholarly retrieval environment, which might be implemented within scholarly information portals. To show that a user’s search improves by using these model-driven search services when interacting with a scientific information system, an information retrieval evaluation was conducted. Since the assessors in this experiment were neither information professionals nor domain experts we had especially looked at the level of agreement between the different assessors. This also is our special interest since in big evaluation campaigns like CLEF only a minority of the evaluations made their 1 http://www.gesis.org/irm evaluations with regard to the inter-rater agreement. In the CLEF campaign of 2009 no analysis of inter-rater agree- ment was made in the classical ad-hoc track [Ferro and Carol, 2009] and only one to four assessors had to judge the pooled documents. The only track that made these inter-rater analyses was the medical imaging track [Müller et al., 2009], but they only had two assessors per topic. These numbers are quite low compared to our evaluation where for some topics there were up to 15 assessors. With the high number and the non-professional background of our assessors a statistical measure is needed to rate the reliability and consistency of agreement between the dif- ferent assessments. Several metrics have been developed to measure inter- grader reliability, like mean overlap or Fleiss’ Kappa, which will be briefly presented in the next section. After that we will introduce the three services that are based on the principles of co-words, core journals and centrality of authors. The basic assumptions and concepts are pre- sented. The conducted evaluation and study with 73 par- ticipants is described in the following section. The paper closes with a discussion of the observed results. 2. Inter-rater Agreement The reliability and consistency of agreement between different assessments can be measure by a range of statis- tical metrics. Two of these metrics to measure inter-grader reliability are for example mean overlap or Fleiss’ Kappa. 2.1. Mean Overlap and Overall Agreement In the early TREC conferences (namely till TREC 4) a rather simple method was used to measure the amount of agreement among different assessors by calculating the overlap between the different assessors judgements. Over- lap was defined by the size of the intersection divided by the size of the union of the relevant document sets. Des- pite the relatively high average overlap, Voorhees [2000] reported that there were significant different judgements for some topics. She reported that in TREC topic 219 for example there was no single intersection between two assessors at all. Therefore the mean overlap in this TREC topic was between 0.421 and 0.494 comparing two asses- sors directly and 0.301 comparing the set of all assessors together. This measure was discarded after TREC 4. One of the reasons was that this measure gets very unstable when more than three assessors are taken into account. Implications of Inter-Rater Agreement on a Student Information Retrieval Evaluation Philipp Schaer, Philipp Mayr, Peter Mutschke GESIS – Leibniz Institute for the Social Sciences 53113, Bonn, Germany philipp.schaer@gesis.org 213 2.2. Fleiss’ Kappa Fleiss’ Kappa is a measure of inter-grader reliability – based on Cohen’s Kappa – for nominal or binary ratings [Fleiss, 1971]. The Kappa value can be interpreted as the extent to which the observed amount of agreement among raters exceeds what would be expected if all raters made their ratings completely randomly. Fleiss’ Kappa is given in equations 1 to 3: € κ = P − Pe 1− Pe (1) where € P = 1Nn(n −1) nij (nij −1)j=1 k ∑ i=1 N ∑ (2) € Pe = 1Nn niji=1 N ∑ ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ j=1 k ∑ 2 (3) N is the total number of subjects (e.g. documents to be assessed); n is the number of judgments per subject (rat- ers); k is the number of response categories. Kappa scores can range from 0 (no agreement) to 1.0 (full agreement). Landis and Koch [1977] suggest inter- preting the score as followed: κ<0 = poor agreement, 0≤κ<0.2 = slight agreement, 0.2≤κ<0.4 = fair agreement, 0.4≤κ<0.6 = moderate agreement, 0.6≤κ<0.8 = substantial agreement, 0.8≤κ≤1 = (almost) perfect agreement. These interpretations are not generally accepted and other inter- pretations are possible. Greve and Wentura [1997] suggest to interpret scores κ<0.4 as “not be taken too seriously” and values of 0.4≤κ<0.6 as acceptable. 0.75≤ κ seems good up to excellent. 3. Evaluated System and Services All proposed models are implemented in a live informa- tion system using (1) the Solr search engine, (2) Grails Web framework and (3) Recommind Mindserver to dem- onstrate the general feasibility of the approaches. Solr is an open source search platform from the Apache Lucene project2, which uses a tf-idf-based ranking mechanism3. The Mindserver is a commercial text categorization tool, which was used to generate the query expansion terms. Both Bradfordizing and author centrality as re-rank mechanism are implemented as plugins to the open source web framework Grails, which is the glue to combine the different modules and to offer an interactive web-based prototype4. 3.1. Query Expansion by Search Term Recom- mendation When using search in an information system a user has to come up with the “correct” terms to formulate his query. These terms have to match the terms used in the docu- ments or in the description of the documents to get an appropriate result. Especially in the domain of metadata- 2 http://lucene.apache.org/Solr/ 3 http://lucene.apache.org/java/2_4_0/scoring.html 4 http://www.gesis.org/beta/prototypen/irm/ driven Digital Libraries this is long known as the language problem in IR [Petras 2006]. The Search Term Recommender (STR) is based on sta- tistical co-word analysis and builds associations between free terms (i.e. from title or abstract) and controlled terms (i.e. from a thesaurus). Controlled terms are assigned to the document during a professional indexation and enrich the available metadata on the document. The co-word analysis implies a semantic association between the free and the controlled terms. The more often terms co-occur in the text the more likely it is that they share a semantic relation. The commercial classification software Recom- mind Mindserver, which is based on Support Vector Ma- chines (SVM) and Probabilistic Latent Semantic Analysis (PLSA), calculated these associations. The software was trained with the SOLIS database to best match the special- ized vocabularies used in the Social Sciences. When used as an automatic query expansion service the Mindserver’s top n=4 suggested terms were taken to en- hance the query. The query was expanded by simply OR- ing them. If we take a sample query on Poverty in Ger- many, it is expanded to: povert* AND german*  (povert* AND german*) OR "poverty" OR "Federal Republic of Germany" OR "so- cial assistance" OR "immiseration" 3.2. Re-Ranking by Bradfordizing The Bradfordizing re-ranking service addresses the prob- lem of oversized result sets by using a bibliometric method. Bradfordizing re-ranks a result set of journal articles according to the frequency of journals in the result set such that articles of core journals – journals which publish frequently on a topic – are ranked higher than articles from other journals. This way the central publica- tion sources for any query are sorted to the top positions of the result set [Mayr, 2009]. Figure 1: Screenshot of the web-based assessment tool. The users were shown the name and description of the task (top) and a number of documents to assess. The documents (middle) had author, publication year, title, abstract and keywords, which the assessors could use to judge the documents relevant or non- relevant (right). 214 In a first step the search results are filtered with their ISSN, since ISSNs are proper identifiers for journals. The next step aggregates all results with the same ISSN. For this step the build-in faceting mechanism of Solr is used. The journal with the highest ISSN facet count gets the top position in the results; the second journal gets the next position, and so on. In the last step, each documents rank (given through Solr’s internal ranking) is boosted by the frequency counts of the journals. This way all documents from the journal with the highest ISSN facet count are sorted to the top. 3.3. Re-Ranking by Author Centrality Author centrality is another way of re-ranking result sets. Here the concept of centrality of authors in a network is an approach for the problem of large and unstructured result sets. The intention behind this ranking model is to make use of knowledge about the interaction and coopera- tion behavior in fields of research. The model assumes that the relevance of a publication increases with the cen- trality of their authors in co-authorship networks. The user is provided with publications of central authors when the result set obtained is re-ranked by author centrality. The re-ranking service calculates a co-authorship net- work based on the result set to a specific query. Centrality of each single author in this network is calculated by ap- plying the betweenness measure and the documents in the result set are ranked according to the betweenness of their authors so that publications with very central authors are ranked higher in the result list. Since the model is based on a network analytical view it differs greatly from text- oriented ranking methods like tf-idf [Mutschke, 2004]. 4. Evaluation We conducted a user assessment with 73 information science students who used the SOLIS database5 with 369,397 single documents on Social Science topics to evaluate the performance of the three presented services. The documents include title, abstract, controlled keywords etc. The assessment system, which was built on top of the IRM prototype described earlier and all documents were in German. All written examples in this paper are trans- lated. 4.1. Method A standard approach to evaluate Information Retrieval systems is to do relevance assessments, where – in respect to a defined information need – documents are marked as relevant or not relevant. There are small test collections (like Cranfield) where all containing documents are judged relevant or not relevant. For large collections (like TREC, CLEF etc.) this is not possible, so only subsets are assessed. Only the top n documents returned by the differ- ent retrieval systems are assessed. Pooling is used to dis- guise the origin of the document [Voorhees and Harman, 2005]. In our assessment the participants were given a con- crete search task, which was taken from the CLEF corpus. After a briefing each student had to choose one out of ten different predefined topics (namely CLEF topics 83, 84, 88, 93, 96, 105, 110, 153, 166 and 173). Topic title and the description were presented to form the information need (cp. table 1 and figure 1). 5 SOLIS is accessable via the Sowiport portal http://www.gesis.org/sowiport Fleiss’ Kappa Mean overlap topic title description n N k κ >=0.8 =1 83 Media and War Find documents on the commentatorship of the press and other media from war regions. 15 40 2 0.522 0.727 0.225 84 New Media in Education Find documents reporting on benefits and risks of using new technology such as com- puters or the Internet in schools. 11 40 2 0.304 0.5 0.15 88 Sports in Nazi Germany Find documents about the role of sports in the German Third Reich. 6 40 2 0.528 0.75 0.425 93 Burnout Syndrome Find documents reporting on the burnout syndrome. 10 40 2 0.411 0.8 0.35 96 Costs of Vocational Educa- tion Find documents reporting on the costs and benefits of vocational education. 2 40 2 0.488 0.775 0.775 105 Graduates and Labour Mar- ket Find documents reporting on the job market for university graduates. 5 40 2 0.466 0.675 0.525 110 Suicide of Young People Find documents investigating suicides in teenagers and young adults. 5 40 2 0.222 0.625 0.425 153 Childlessness in Germany Information on the factors for childlessness in Germany 10 40 2 0.202 0.325 0.175 166 Poverty in Germany Research papers and publications on poverty and homelessness in Germany. 9 40 2 0.438 0.5 0.25 173 Propensity towards violence among youths Find reports, cases, empirical studies and analyses on the capacity of adolescents for violence. 10 40 2 0.411 0.55 0.2 avg. 0.4 0.622 0.35 Table 1: Ten CLEF topics and the corresponding inter-grader reliability expressed by Fleiss’ Kappa and mean overlap per topic. The mean overlap is calculated using two different thresholds (>=0.8 means that an intersection rate of 80% is counted and =1 means that only perfect matches are counted). 215 The pool was formed out of the top n=10 ranked docu- ments from each service and the initial tf-idf ranked result set respectively. Duplicates were removed, so that the size of the sample pools was between 34 and 39 documents each. The assessors could choose to judge relevant or not relevant (binary decision) – in case they didn’t assess a document this document was ignored in later calculations. 4.2. Data Set The 73 assessors, who were information science students, did 43.78 assessments in average. They did 3,196 single relevance judgments in total. Only 5 participants didn’t fill out the assessment form completely. Since every as- sessor could freely choose from the topics the assessments are not distributed evenly. Topic 83 was picked 15 times – topic 96 twice. It was a conscious decision to allow each assessor to freely choose a topic he or she is familiar with, but this had direct consequences on the number of single assessments (cp. table 2). 5. Results Since the assessors in this experiment were neither infor- mation professionals nor domain experts the results should be discussed with a special interest on the level of agreement between the assessors. To rate the reliability and consistency of agreement between the different as- sessments we used mean overlap and Fleiss’ Kappa, de- scribed in section 2. Besides that precision of each topic/ service combination and the intersection of the different result sets was calculated. 5.1. Mean Overlap and Overall Agreement Normally only a true overlap is counted. When comparing the judgements of two assessors they can either agree or disagree. When dealing with more than two assessors the situation gets difficult: What to do if 14 assessors agree but one disagrees? To take majority decision into account we applied two different thresholds calculating the mean overlap: 1 and 0.8. Only perfect matches count when a threshold of 1 is applied: This means that all assessors have to agree 100%. Here the measured mean overlap (140 intersections where all assessors agreed 100% in a total union set of 400) was 0.35 (cp. table 1). This value can be compared to the reported values from the selected TREC topic (see section 2.1) where the overlap was be- tween 0.421 and 0.494 comparing two assessors directly and 0.301 comparing the set of all three assessors to- gether. The more assessors per topic the lower the mean over- lap gets. This is quite natural since 100% agreement is easier to obtain between 2 assessors (for example for topic 96) than for 15 assessors (like for example in topic 83). Therefore a second threshold of 0.8 was applied. Here all judgements with only slight disagreement where also taken into consideration. This had direct implications on the average overlap value over all assessments. The aver- age overlap rate was 0.622. 5.2. Fleiss’ Kappa The equation for Fleiss’ Kappa is given in section 2.2. In our study N was always forty (ten documents for each value added service – here assessments for duplicates count for all services that returned it) and k was always two (binary decision). All Kappa scores in our experiment range between 0.20 and 0.52, which are fair up to moderate levels of agree- ment or mainly acceptable in the more conservative inter- pretation (cp. table 1). The average Kappa value was 0.4. 5.3. Precision The precision P was calculated by equation 4: € P = |{rel}∩{ret} ||{ret} || (4) P was calculated for each topic and service, where |{rel}| is the number of all relevant assessed documents and |{ret}| is the number of all retrieved documents which were in the pool (relevant and not relevant). All unfiltered precision values and numbers of relevance assessments can be seen in table 2 and figure 2. The average precision of the STR was highest (68%) compared to the baseline from the SOLR system (57%). The two alternative ranking methods Bradfordizing (BRAD) and author centrality (AUTH) scored 57% and 60% respectively. We calculated additional average precision values and left out the topics marked unstable by the mean overlap values (those with a value of <=0.35 in case of a 100% match). Here the topics 83, 84, 153, 166 and 173 were left out. This was not considered useful since only half the 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 83 84 88 93 96 105 110 153 166 173 avg. avg. k. avg. o. AUTH BRAD SOLR STR Figure 2: Precision for each topic and service (Relevance assessments per topic / total amount of single assessments). The three aver- age precision values are (1) avg.: unfiltered values, (2) avg. k.: average computed with a kappa threshold of 0.4 and (3) avg. o.: aver- age computed with a overlap threshold of 0.35. 216 data set remains after this data clearance step. Neverthe- less calculating the average precision for each service without the five topics would have been: AUTH: 63%, BRAD: 63%, SOLR 61% and STR: 65%. Considering the Kappa values the relevance judgments from topics 84, 110 and 153 were below the threshold of 0.40 and so they might have been dropped. Calculating the average precision for each service without the three unstable topics would have been: AUTH: 61%, BRAD: 56%, SOLR 52% and STR: 64% (cp. table 2). 5.4. Intersection of result sets A comparison of the intersection of the relevant top 10 document result sets between each pair of retrieval service shows that the result sets are nearly disjoint. 400 docu- ments (4 services * 10 per service * 10 topics) only had 36 intersections in total (comp. figure 3). Thus, there is no or very little overlap between the sets of relevant top documents obtained from different rankings. AUTH and SOLR as well as AUTH and BRAD have just three relevant documents in common (for all 10 top- ics), and AUTH and STR have only five documents in common. BRAD and SOLR have six, and BRAD and STR have five relevant documents in common. The larg- est, but still low overlap is between SOLR and STR, which have 14 common documents. 6. Discussion The discussion will focus on two different aspects: (1) the results of the inter-rater agreement and our implications to our evaluation setup and (2) the evaluation itself and the outcomes regarding precision and intersection of result sets. 6.1. Inter-rater agreement The central question we have to consider is to what amount the evaluation is significant before we can inter- pret the results in form of precision values. At first our results looked promising, when comparing them to the mean overlap agreement rate from the TREC studies, especially when taking the high number of assessors per topics into account. After we had a look at the mean over- lap and Kappa values we had to differentiate more on the single topics since their performance was not simply comparable. Taking the non-professional assessors and the Kappa values in consideration we have to ask ourselves how reliable our evaluations are. The general setup is not un- usual: Other studies also relied on non-professional asses- sors and showed promising results: Al-Maskari et al. [2008] observed a rather high overall agreement between official TREC and non-TREC assessors in a study with 56 participants and 4399 single document assessments in total. The agreement changed due to the different topics and the actual ranking position of the assessed document and was between 75 and 82% if both relevant and irrele- vant documents assessments were counted. Alonso and Mizzaro [2009] compared official TREC assessments with anonymous assessors from Amazon Mechanical Turk. They observed a rather small sample of 10 partici- pants and 290 judgments. For the relevant documents the average across of participants was 91% and in case of not relevant documents the average was 49%. 14 3 6 STR AUTH BRAD 5 5 3 SOLR 0 Figure 3: Intersection of top n=10 documents from all topics and services (total of 400 documents). The number 3 in the lower right circle for example means that only 3 documents in the result set returned by the SOLR system and the result set re- turned by the author centrality service AUTH are the same. non relevant relevant precision topic A U TH B R A D S O LR S TR A U TH B R A D S O LR S TR A U TH B R A D S O LR S TR 83 42 104 36 25 98 38 114 109 0.70 0.27 0.76 0.81 84 71 38 27 26 37 69 77 56 0.34 0.64 0.74 0.68 88 51 22 28 12 9 38 32 48 0.15 0.63 0.53 0.80 93 28 26 25 26 72 74 73 76 0.72 0.74 0.74 0.75 96 3 8 12 13 19 14 8 7 0.86 0.64 0.40 0.35 105 15 18 15 24 28 26 30 20 0.65 0.59 0.67 0.45 110 13 22 15 5 37 28 35 45 0.74 0.56 0.70 0.90 153 42 40 36 32 57 54 62 64 0.58 0.57 0.63 0.67 166 30 39 72 33 56 48 15 54 0.65 0.55 0.17 0.62 173 41 48 57 26 53 46 36 68 0.56 0.49 0.39 0.72 avg. 0.60 0.57 0.57 0.68 avg.(kappa >= 0.4) 0.61 0.56 0.52 0.64 avg. (overlap >= 0.35) 0.63 0.63 0.61 0.65 Table 2: Relevance judgments for each topic and service (total number of non-relevant and relevant judgments) and the calculated precision values (with different thresholds applied). 217 Nevertheless a mandatory step should be to sort out subsets of the assessment data where Kappa values (or other reliable measures) are below a certain threshold. Unfortunately this is not done in all studies. 6.2. Precision and Intersection Comparing the precision values of the different ap- proaches the three services in average all performed better (in case of the filtered topic list) or at least same (when counting all assessments) as the naïve tf-idf baseline. This is true for all average precision values calculated regard- ing all kinds of thresholds and data clearing. When inspected on a per-topic basis the performance is more diverse. While the STR outperforms in nearly all cases the other services had topics where their precision was significantly smaller compared to the other sources. For topic 83 the Bradfordizing re-ranking service couldn’t sort the result set appropriately while the author centrality service couldn’t adequately handle topic 88. Discussing the results of the two proposed re-ranking methods Bradfordizing and author centrality brings up two central insights: (1) users get new result cutouts with other relevant documents which are not listed in the first section (first n=10 documents) of the original list and (2) Bradfordizing and author centrality can be a helpful in- formation service to positively influence the search proc- ess, especially for searchers who are new on a research topic and don’t know the main publication sources or the central actors in a research field. The STR showed an expected behavior: While the result set in total increases the first n=10 hits are more precise. This is quite normal in query expansion scenarios where the high descriptive power of the controlled terms that are added to the query increases the precision [Efthimiadis, 1996]. This is an indicator for the high quality of the semantic mapping between the language of scientific discourse (free text in title and abstract) and the language of documentation (controlled thesauri terms). The very low overlap of the result sets as described in section 5.4 confirms that the value-added services pro- posed provide a quite different view to the document space: Not only from a term-centric view proposed by tf- idf (with or without query expansion mechanism) but also from a more person- or journal-centric perspective. Which is even more interesting since the average precision values didn’t differ as much as one might have expected. 7. Outlook After the evaluation and the analysis regarding inter-rater agreement two important implications emerge: (1) the inter-rater agreement rates were mainly fair to moderate and therefore showed a general feasibility of a non- experts evaluation and (2) after a data-cleaning step which erased the assessments with very poor agreement rates the evaluation data showed that the three retrieval services returned disjoint but still relevant result sets. The services provide a particular view to the information space that is quite different from traditional retrieval methods. Although the results looked promising our next step is a new relevance assessment with scientist and domain ex- perts to have a direct comparison and to reinsure our ob- servations. Acknowledgments Vivien Petras and Philipp Mayr supervised the 73 students in their courses at Humboldt University and University of Applied Science Darmstadt. Hasan Bas implemented the assessment tool during his internship with GESIS. This work was funded by DFG (grant number INST 658/6-1). References [Alonso and Mizarro, 2009] Omar Alonso and Stefano Mizzaro. Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment. Proceedings of the SIGIR 2009 Workshop on The Future of IR Evaluation : 15-16, 2009. [Al-Maskari et al., 2008] Azzah Al-Maskari, Mark Sand- erson, and Paul Clough. Relevance judgments between TREC and Non-TREC assessors. Proceedings of the 31st annual international ACM SIGIR, 2008. [Bates, 1990] Marcia J. Bates. Where Should the Person Stop and the Information Search Interface Start? Infor- mation Processing & Management, 26 575-591, 1990. [Efthimiadis, 1996] Efthimis N. Efthimiadis. Query ex- pansion. Annual Review of Information Systems and Technology (ARIST), Vol. 31, 121-187, 1996. [Ferro and Carol, 2009] Nicola Ferro and Carol Peters. CLEF 2009 Ad Hoc Track Overview: TEL & Persian Tasks. Working Notes for the CLEF 2009 Workshop, Corfu, Greece, 2009. [Fleiss, 1971] Joseph L. Fleiss. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382, 1971. [Greve and Wentura, 1997] Werner Greve and Dirk Wen- tura. Wissenschaftliche Beobachtung: Eine Einführung. Weinheim, PVU/Beltz, 1997. [Landis and Koch, 1977] J. Richard Landis and Gary G. Koch. The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), 159-174, 1977. [Mayr et al., 2008] Philipp Mayr, Peter Mutschke, and Vivien Petras. Reducing semantic complexity in dis- tributed digital libraries: Treatment of term vagueness and document re-ranking. Library Review, 57(3), 213- 224, 2008. [Mayr, 2009] Philipp Mayr. Re-Ranking auf Basis von Bradfordizing für die verteilte Suche in Digitalen Bi- bliotheken. Humboldt-Universität zu Berlin, Berlin, 2009. [Mutschke, 2004] Peter Mutschke. Autorennetzwerke: Netzwerkanalyse als Mehrwertdienst für Informations- systeme. In: Information zwischen Kultur und Markt- wirtschaft (ISI 2004), 141-162, 2004. [Müller et al., 2009] Henning Müller, Jayashree Kal- pathy–Cramer, Ivan Eggel, Steven Bedrick, Saïd Rad- houani, Brian Bakke Jr., Charles E. Kahn, and William Hersh. Overview of the CLEF 2009 Medical Image Re- trieval Track. Working Notes for the CLEF 2009 Work- shop, Corfu, Greece, 2009. 218 [Petras, 2006] Vivien Petras. Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages. University of California, Berkley, 2006. [Voorhees, 2000] Ellen M. Voorhees. Variations in rele- vance judgments and the measurement of retrieval ef- fectiveness. Information Processing & Management. 36(5): 697-716, 2000. [Voorhees and Harman, 2005] Ellen M. Voorhees and Donna K. Harman (Eds.). TREC: Experiment and Evaluation in Information Retrieval: The MIT Press, 2005. 219 Abstract To explore how Arab Internet users can find the information in their mother tongue on the web, the five web search engines Araby, Ayna, Google, MSN and Yahoo were tested on an in- formation retrieval evaluation basis with the con- sideration of the web-specific evaluation re- quirements. The test used fifty randomly selected queries from the top searches on the Arabic search engine Araby. The relevance of the top ten results and their descriptions retrieved by each search engine for each query were evaluated by independent jurors. Evaluations of results and descriptions were then compared to assess their conformity. The core finding was that Google performed almost all the times better than the other engines. The difference to Yahoo was however not statically significant, and the differ- ence to the third ranked engine MSN was signifi- cant to a low degree. The Arabic search engine Araby showed performance on most of the evaluation measures, while Ayna was far behind all other search engines. The other finding was the big differences between search results and their descriptions for all tested engines. 1 Introduction This work represents an evaluation test performed on mul- tiple web search engines which can deal with Arabic and the specific needs of this language and its speakers as the target users group considered in the test. The question this test attempts to answer is: which web search engine of the five tested in this study can retrieve the “best” search results for the user judging these results? The test compares the effectiveness of five different web search engines, two of which are native Arabic engines: Araby and Ayna, and three are international Arabic- enabled engines Google, MSN and Yahoo. The motivation of the test is the lack of evaluative re- search of Arabic information retrieval systems, especially on the internet. This is despite the very high growth rates of internet users in the Arab countries, and that most users can not read English which dominates the content on the World Wide Web [Hammo, 2009]. 2 Related Work The evaluation of web search engines has been gaining increased importance and research interest since its early beginnings in 1996. A large number of evaluation experi- ments has been performed to assess the performance of search engines from different perspectives using varied evaluation measures and test designs. Most evaluation tests used search queries in English as a dominant language on the web. However, many tests in other languages are also taking place, focusing mainly on the way search engines deal with different languages, their linguistic issues and proper search algorithms aiming to improve the multilingual capabilities of search engines. Other studies, like this work, focus on the performance evaluation of web search engines from the local users’ point of view. An overview on non-English web informa- tion retrieval studies is presented by Lazarinis and others [Lazarinis, 2007; Lazarinis et al., 2009]. Several studies (like [Moukdad and Large, 2001], [Mouk- dad, 2002; 2004], [Abdelali et al., 2004] and recently [Hammo, 2009]) discussed Arabic information retrieval on the web. Gordon and Pathak [1999] discussed a collection of web search engines evaluation tests conducted since 1996 against their test methodology and purpose. Another overview aligned with the recommendations of Tague-Sutcliffe [1992] for general information retrieval evaluation, and Hawking et al. [2001] for web-specific information retrieval evaluation criteria is presented in [Lewandowski, 2008]. The most recent two studies similar to this work in their methodology and design are: Griesbaum tested three search engines (Google, Lycos and AltaVista) using 50 queries, in German language [Griesbaum, 2004]. Google came in the first place in the overall performance judgment followed by Lycos with no significant difference, and AltaVista came in the last place with a higher difference to Google, but not to Lycos. The second study has been conducted by [Lewandowski, 2008]. He evaluated five search engines (Google, Yahoo, MSN, Ask.com and Seekport). A set of forty queries, in German language, was created by forty faculty students who were the jurors as well. The study found no signifi- cant reason to favor any of the major search engines in terms of performance and concluded that more attention should be paid by search engine companies to the quality of results descriptions. 3 Test Methodology Tague-Sutcliffe [1992] presented a guide for information retrieval evaluation which helps the experimenters in making the required decisions while planning an evalua- tion test to ensure the validity of the experiment, the reli- ability of the results and the efficiency of the test proce- Evaluation of five web search engines in Arabic language Wissam Tawileh, Thomas Mandl and Joachim Griesbaum University of Hildesheim D-31141, Hildesheim, Germany tawilehw@uni-hildesheim.de 221 dures. It assumes to answer ten questions which are dis- cussed here for this study. 3.1 Need for testing Keeping in mind the limited knowledge of foreign lan- guages, especially English, among Arab internet users, many questions can be stated considering the effective- ness and efficiency of searching the web in Arabic using local Arabic search engines or international Arabic en- abled ones. While these questions can cover the search systems them- selves (search algorithms, language handling issues, re- sources consumption… etc.), another aspect can also be questioned, which is the informativeness and effectiveness of web search engines in the eyes of Arab internet users. To the best knowledge of the authors, there is no similar published evaluation test before this work. 3.2 Type of test The test evaluates five online commercial search engines in the Arabic language working on the web as an opera- tional database. This kind in real use differs from the experimental tests performed in a laboratory environment based on the Cran- field paradigm [Mandl, 2008]. Such tests allow a higher level of control on tests parameters and variables, whereas the test presented here gives an assessment closer to real life and the users’ perception of results. 3.3 Variables definition The independent variables which affect the results of this test are: evaluation criteria, relevance performance meas- ures, queries, information needs and participants (also referred to as users or jurors). These variables are defined in this section to illustrate the general settings of the test. The dependent variables are the relevance judgments of the jurors which are the indicator of the retrieval perform- ance in this test. These are discussed in the results section. Evaluation criteria Like in most information retrieval tests, the main evalua- tion criterion of search results is relevance. This should measure the ability of tested web search engines to satisfy the users’ information needs described in their search que- ries. Relevance assessment is a problematic issue discussed in several studies, and can be influenced by many factors [Schamber, 1994]. A representative assessment can how- ever be done by individual jurors to avoid bias of the re- searcher in the judgment process. Search results are evaluated in this test on a binary basis to be “relevant” or “not relevant”. Following [Lewandowski, 2008] a search result should satisfy the users’ information needs without taking further actions. Results descriptions are also evaluated on a binary basis as “seems relevant” or “seems not relevant”. To cover the possibility that a search engine may present a result without a description in its results list, the evalua- tion option “no description available” is also given to ju- rors. As this test is specifically designed for Arabic search evaluation and targeting Arab internet users with informa- tion needs in their native language, all retrieved docu- ments in languages other than Arabic should be evaluated as “not relevant”. Jurors were instructed before they start to judge Non-Arabic results descriptions as “seems not relevant” as well. Relevance performance measures Precision is a standard information retrieval evaluation measure used in this test. The other standard evaluation measure, recall, used to evaluate the performance of clas- sical information retrieval systems can not easily be ap- plied to web search engines evaluations as the total num- ber of relevant documents can not be estimated. As most internet users usually look only at the first one or two results of a query from a search engine, a cut-off value of the first ten results can give reliable evaluation results using the so called top-ten precision. Precision values will be calculated on both the macro and the micro levels [Womser-Hacker, 1989]. Queries To be as close as possible to real-life search behavior of Arab web users, a random set of search queries is selected from the most used search queries on the Arabic web search engine (Araby.com). A collection of fifty queries (a standard for TREC evalua- tion tests) is a reasonable amount for valid evaluation re- sults. Additional ten queries were reserved for any prob- lems that may occur during the test. Queries were selected and executed exactly as typed by the original users (as listed in the search engine Araby on 10. March. 2009). No correction or alternative writing methods were suggested. Information needs After selecting the random set of search queries, a recon- struction of the information needs behind these queries is necessary to simulate the needs of users originally entered these queries in the search engine and form the relevance judgment criteria. This task is particularly difficult with general short queries. A group of Arab internet users (mainly students and engi- neers) were asked to describe their needs of information when searching for given five different queries. All de- scriptions of each query were then merged to form the relevance judgment criteria. For the search query “Say- ings” for example, relevant documents should contain: “Sayings of elders, politicians or celebrities”. Participants Participants in this test had to be native Arabic speakers and to have average knowledge of internet browsing and usage of web search engines. A total number of seventy volunteers (53 males and 17 females) filled out the information needs reconstruction forms. To avoid bias in the information needs simulation, users from multiple ages and different education back- ground described information requirements they may as- sociate with the given search queries. The ideal number of evaluation jurors for fifty queries is equal to fifty, so that each juror can evaluate a single query on all tested search engines. Out of eighty nine in- vited users (friends and colleagues of the first author), the total number of fifty jurors (42 males and 8 females) from nine Arab countries was achieved. 3.4 Search engines selection According to the recommendations in [Hawking et al., 2001] for the evaluation of web search engines, the major 222 search engines should be included in the test. As this work tries to explore the suitability of native Arabic web search engines as alternatives to international market leading engines for Arab users, local search engines were tested in addition to the leading international engines. The most popular five search engines in the Arab coun- tries according to Alexa were selected, two search engines are native Arabic and three are international Arabic- enabled engines. The selected search engines are: - Araby (www.araby.com) - Ayna (www.ayna.com) - Google (www.google.com.sa) - MSN (www.live.com1) - Yahoo! (www.yahoo.com) 3.5 Finding queries Although there is no published statistics about queries length and complexity for Arabic web searches, the most searched queries on the Arabic search engine Araby showed that Arab users conform to other internet users in using very short and rather unspecific search queries [Jansen et al., 2000]. Search operators (e.g. Boolean op- erators) were not used in search queries. 3.6 Processing queries To collect search results from all tested search engines at almost the same time, the queries were processed on a single day one query at a time on all engines with a mini- mal time interval. This eliminates the possibility of index changes over the tested search engines while processing the single queries which may give one engine an advan- tage over the others. Results lists were then saved as HTML pages on a local drive. 3.7 Experimental design The experimental design in this test is based on the re- peated-measures design presented in [Tague-Sutcliffe, 1992] and used in [Griesbaum, 2004]. The jurors had to evaluate the top ten search results (from one to ten) for a single query presented by each web search engine without knowing the source of the results to avoid bias caused by users’ preferences of a particular search engine they are familiar with. The second task was to evaluate the search results descrip- tions of a single different query on the five tested engines. In this part the sources of the results were known to ju- rors, as they evaluate the descriptions on the locally saved results pages which are identical to the original results pages delivered by the search engines when queries were executed. 3.8 Data collection The initial design was to collect data from jurors in a labo- ratory environment on printed evaluation forms. This de- sign faced however difficulties and was replaced with an online survey design as detailed later in the “Pre-Test” section. Data was collected using an online survey service in digital formats which enable different analyses. 3.9 Data analysis To obtain a binary relevance judgment, not found docu- ments were added to “not relevant” documents in the re- 1 Officially replaced on 03.06.09 with a new search service from Microsoft (www.bing.com) sults evaluation calculations. Results with no descriptions are also considered “seems not relevant”. Using the collected data, the performance of the five tested search engines was evaluated based on the top ten precision. Macro- and micro-precision for the top ten search results were calculated to evaluate the retrieval performance. Micro-precision values are also calculated for the top ten results descriptions to analyze the conformity of search results and their descriptions by comparing these values and applying measures presented in [Lewandowski, 2008]. 3.10 Presenting results The test motivation, design and methodology were de- tailed in the previous sections, the test results are analyzed in a dedicated section and the conclusions section gives a summary of the conducted research and future directions for research based on this work. The complete work is submitted by the first author as a Master thesis at the University of Hildesheim. 4 Performing the test 4.1 Pre-Test To examine the initial test design, a pre-test was con- ducted on 03. April 2009 where six participants executed searches for given queries (one by each juror) with the five tested search engines. The search results pages where recorded and the users judged them based on the results descriptions and subsequently, based on the full result documents. The judgments were given on a printed evaluation form. This design, however, faced the following main problems: - It was extremely difficult to plan the test timing to suit all users who do not participate as a part of a uni- versity course or a job task. - The pre-test users found the test tasks complicated, confusing and tiring. - An extra fatigue effect surfaced as a result of the un- reliable internet connection on the test location. All these problems showed that a laboratory test will not be useful or can not be conducted at all at the time and place initially planned. To avoid the disadvantages of the test location, an alternative solution was to involve Arab jurors geographically distributed over multiple countries by performing the test online as shown in the next section. 4.2 Test All queries were processed on 11. April 2009 in Germany. Results lists and results were saved locally for documenta- tion and prepared on extra web pages for the evaluation process. Jurors only had to visit given links and evaluate the delivered pages digitally on the provided online form. The responses collection for online surveys was open in the period from 14. April to 12. May 2009. This long pe- riod of time can cause variations in the evaluation process due to the highly dynamic nature of the web; it was, how- ever, needed to allow the large number of jurors to find a suitable time slot in their specific location. This effect can be avoided by obligating the participants to work on a certain date, which rises however the questions about us- ers’ motivation. The collected digital data was relatively easy to analyze and process. 223 5 Results 5.1 Number of relevant result documents The first information that can be obtained about the tested search engines from the evaluation data is the number of relevant result documents; this is displayed in (Figure 1). Figure 1: Number of relevant result documents Google retrieved the largest amount of relevant docu- ments followed by Yahoo then MSN. The native Arabic search engine Araby came in the fourth place with a clear distance to Google. A large gap was found between those four search engines and the other Arabic search engine Ayna, which came in the last place. Another finding from the last figure is that both Arabic search engines retrieved a high proportion of absolutely “not relevant” documents (only documents judged as “not relevant” excluding the “not found” documents). MSN with 8.2% of its results delivered the least search results pointed to lost documents (dead links), followed by Google with 10.2% then Araby which was, with 10.6%, better than Yahoo with 11.8% of dead links in its results lists. Again Ayna came in the last place with more than the double that ratio of all other engines. These numbers can give an idea on the up to datedness of the search engines indices to a certain extent, but can also be influenced by many factors. 5.2 Number of relevant descriptions The number of documents among the top ten hits which seemed to be relevant according to their snippet from the five search engines is shown in Figure 2. Figure 2: Number of relevant results description The best judged descriptions were from Google which delivered the highest ratio of relevant descriptions (69.2%). Close after Google was MSN with 68.4% then Yahoo with 66.4% of positively judged descriptions. A clear distance separated these from Araby with 51.4% and Ayna with 43.4%. The proportion of 40.6% of irrelevant descriptions in the results list of Ayna means that a user may ignore up to this amount of top ten search results because of their de- scriptions. 38.8% of the descriptions delivered by Araby also gave a negative idea about the results described. With 26.8% for MSN, 26.2% for Yahoo and 25.4% for Google the international search engines gave a lower chance for bypassing results from the first look at their descriptions. MSN tried to describe the most delivered results out of which 4.8% did not have a description. Google failed similarly to deliver descriptions to 5.4% of presented re- sults and a higher proportion was by Yahoo at 7.4%. Even with 9.8% of results without descriptions, Araby was bet- ter than Ayna which delivered 16% of its top ten search results without any description. 5.3 Descriptions-Results conformity To evaluate how good a search engine can form results descriptions, a comparison between results and results descriptions is needed. Figure 3: Descriptions-result s conformity The comparison displayed in (Figure 3) sums not found results to “not relevant” ones and not available descrip- tions to descriptions “seem not relevant”. It shows that all tested search engines presented more relevant descriptions than real relevant results. Although the number of relevant descriptions delivered by Araby was lower than the relevant descriptions delivered by the three international search engines, Araby had the lowest difference of 7.8% between the counts of relevant descriptions and relevant results. For Google, there were 9.7% more relevant snippets than relevant documents, 9.9% for Yahoo and 14.8% for MSN. Ayna exhibited the largest difference of 28% between the numbers of relevant descriptions in comparison with rele- vant results. As the relevance judgment of results and results descrip- tions for each query was done by two different jurors, these results can be influenced as discussed in [Gries- baum, 2004] by formal and contextual variations in the descriptions presentation and by preference factors. A high number of relevant descriptions does not mean necessarily that they correctly describe the real results and that users could depend on these descriptions to visit rele- vant results and avoid irrelevant ones. Figure 4 shows the number of documents for which the relevance judgment based on snippet and full document was equal or different. A high judgment consistence of description-result pairs 0 50 100 150 200 250 300 350 400 Araby 257 194 49 Ayna 217 203 80 Google 346 127 27 MSN 342 134 24 Yahoo 332 131 37 Looks relevant Looks not relevant No Description 0 50 100 150 200 250 300 350 400 Seemss relevant 257 217 346 342 332 Relevant 220 122 285 254 272 Seems Not Relevant 243 283 154 158 168 Not Relevant 280 378 215 246 228 Araby Ayna Google MSN Yahoo 0 50 100 150 200 250 300 Araby 220 227 53 Ayna 122 258 120 Google 285 164 51 MSN 254 205 41 Yahoo 272 169 59 Relevant Not Relevant Document Not Found 224 means a well forming of search results descriptions for both “relevant” and “not relevant” results. Figure 4: Description-result pairs judgment comparison The two native Arabic search engines were equally the best in this regard by presenting the highest ratio of con- sistent description-result pairs. 62.6% of top ten descrip- tions delivered by each of Araby and Ayna were identi- cally judged as their respective results. The second con- formity level was achieved by Google with 61.8% of pre- sented description-result pairs, followed by MSN with 59.6% and lastly Yahoo with 57.6%. Another output of the last figure is how frequent do the tested search engines tend to present irrelevant results with descriptions that reveal to the user that they can be relevant. 75.4% of Ayna’s not matching descriptions gave a better image of the results than they really were, followed by MSN with a close frequency of 71.8% then came Google with 65.9%, closer to Yahoo with 64.2%. Araby was the search engine that provided the least de- scriptions which guided the users to results not actually of the same relevance with 59.9% of the total inconsistent descriptions. 5.4 Results mean average precision (micro- precision) The recall/precision graph plotted in (Figure 5) shows the precision average values for each search engine at the respective rank for the top ten results. Figure 5: Precision graph for the top ten hits This describes how relevant results were distributed in the top ten results lists. Google achieved the best values at the top three results ranks, which are usually the most seen by users [Jansen et al., 2000]. 76% of the results delivered by Google at the first ranking place for all queries were judged as relevant, where the average of the sum of relevant results for the top ten results was 57% (rank 10). The next search engine on the top three ranks was Yahoo with 70% for the first rank and a precision closer to Google at the last rank with 54.4%. Then came MSN which reached 62% at the first rank and 50.8% at the last. The Arabic search engine Araby reached precision values not far from MSN especially at the first rank with 60% of relevant results at the first place in the results list for the fifty queries. However, the later ranks showed higher dif- ferences especially compared to Google. The overall pre- cision at the tenth rank for Araby was at 44%. Ayna delivered results at the top of results list with lower precision than the results at the last rank of all other en- gines. The average precision for the first rank results of fifty queries was 40% only. Ayna exhibited a drop in pre- cision after the second position. The precision at the tenth position was merely 24.4%. The mean average precision values of the top one to ten results for the five tested search engines are shown in (Table 1). Search Engine Mean Average Precision Araby 0.49 Ayna 0.30 Google 0.65 MSN 0.55 Yahoo 0.60 Table 1: Mean average precision Although Araby performed much better than Ayna, both search engines could not reach an acceptable precision value of 50%, where all other engines stayed above this value even at the last ranking places. 5.5 Answering queries (macro-precision) To explore which search engine dealt best with every query of the fifty used in the test, the macro-precision is observed. The precision values from all tested search en- gines for every single query are compared and the engines are ranked accordingly. Search engines with equal preci- sion values for the same query are ranked equally to avoid preferences. The rankings sum comparison should give an overall macro-precision performance view. These ranking frequencies are displayed in (Figure 6). The numbers show how many times each search engine occu- pied which ranking place in the comparison. Figure 6: Answering queries (macro-precision) 0 50 100 150 200 250 300 350 Equally judged 313 313 309 298 288 Differently judged 187 187 191 202 212 Description Better 112 141 126 145 136 Result Better 75 46 65 57 76 Araby Ayna Google MSN Yahoo 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 Pr ec is io n Araby 0.6000 0.5300 0.5000 0.4900 0.5000 0.4900 0.4800 0.4550 0.4467 0.4400 Ayna 0.4000 0.4000 0.3200 0.2900 0.2680 0.2633 0.2571 0.2575 0.2533 0.2440 Google 0.7600 0.7400 0.6933 0.6700 0.6320 0.6133 0.6029 0.5975 0.5911 0.5700 MSN 0.6200 0.5900 0.5800 0.5550 0.5440 0.5200 0.5286 0.5100 0.5133 0.5080 Yahoo 0.7000 0.6500 0.6400 0.6100 0.5960 0.5933 0.5829 0.5675 0.5533 0.5440 1 2 3 4 5 6 7 8 9 10 0 5 10 15 20 25 30 Nu m be r o f R a n ki n gs Araby 8 20 12 8 2 Ayna 1 8 19 18 4 Google 25 12 10 2 1 MSN 16 16 16 2 0 Yahoo 19 19 9 3 0 1st Place 2nd Place 3rd Place 4th Place 5th Place 225 The interpretation of the results reveals the over perform- ance of Google with 25 times ranked at the first place. This means that Google achieved the best top ten preci- sion for 50% of all processed queries. Moreover, it main- tained its position in the first two ranks for 74% of these queries and had the worst precision for one single query. Yahoo was equally ranked at the first and second places for 38% of processed queries at each rank, but was never at the last place. MSN came then with distributed preci- sion performance equally at the first three rankings with 16 times at each rank. Araby was the first search engine for only 8 times and performed at the second and third levels for 64% of the processed queries. It was the worst engine with two que- ries. Ayna, on the other hand, was mostly at the third and fourth places and could reach the first place for only one query. 5.6 Number of answered queries To compare the degree to which each search engine could be helpful for the user, the number of answered queries (queries with at least one retrieved result judged as rele- vant) is calculated. (Table 2) shows that Google and Yahoo could answer all processed queries by retrieving at least one relevant result in the top ten results list. The top ten results from MSN for the two queries “Visual illusion” and “Arabic lan- guage” included no relevant results. Search Engine Answered queries Not Answered queries Araby 46 4 Ayna 39 11 Google 50 0 MSN 48 2 Yahoo 50 0 Table 2: Number of answered queries Araby could not answer the four queries “Olympiad”, “Obama”, “Visual Illusion” and “Sayings”. Although these queries can give an impression that the search en- gine was of no use for the users who entered these unan- swered queries (considering that they only see the top ten results), the findings could be influenced by the subjective judgment and their acceptance can be limited. The results of Ayna seemed, however, clearly disappointing as it per- formed the worst with 22% of not answered queries. 5.7 Number of retrieved documents Although a detailed estimation of indices sizes for the tested search engines is not within the scope of this work, a general idea about these indices can be obtained from analyzing the amount of retrieved documents reported by the engines when processing the queries. The average counts of results delivered by each search engine for all queries and classified by query terms count are displayed in (Figure 7). Ayna reported the largest amount of results for each search query even when it performed the worst as seen in all previous evaluation measures. For the search query “Newspapers” for example, Ayna delivered over 33 Mil- lions of results with a top ten precision value of 0 (i.e. the query was not answered). This may question the indexing method and the retrieval algorithm of this search engine, as presenting over millions of irrelevant results can be a sign of an essential index problem or an improper search algorithm. Yahoo and Google delivered a large amount of results in comparison to Araby and MSN. Figure 7: Mean number of retrived results The decrease in average results count for multiple terms queries is clear on all search engines except for Ayna which showed no consistent behavior. 5.8 Descriptions mean average precision Search results descriptions are evaluated in this work for their importance for users in the decision making to visit a retrieved result. The recall/precision graph of the top ten results descriptions at each result ranking is shown in (Figure 8). Figure 8: Precision graph for the top ten descriptions All the tested search engines reached better precision val- ues for descriptions than for results themselves (Figure 5) at all cut-off values except for Google at the first two rankings where the average precision for the search results was higher than the average for their descriptions. To compare the overall precision performance of the re- sults descriptions, the mean average precision for the top one to ten results descriptions from the five search engines is displayed in (Table 3). Search Engine Mean Average Precision Araby 0.49 Ayna 0.30 Google 0.65 MSN 0.55 Yahoo 0.60 Table 3: Mean average precision for the top one to ten results descriptions 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 All queries 311,003 19,858,956 4,653,100 263,802 6,856,620 1 Term 493,158 31,901,306 7,966,800 431,595 10,852,812 2 Terms 145,943 7,143,604 1,472,429 106,872 3,270,710 3 Terms 71,970 22,699,711 898,500 74,820 1,349,400 4 Terms 12,462 45 767,000 6,250 127,000 Araby Ayna Google MSN Yahoo 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 Pr ec is io n Araby 0.6600 0.6300 0.6200 0.6200 0.6000 0.5700 0.5571 0.5350 0.5244 0.5140 Ayna 0.4400 0.4700 0.4733 0.4500 0.4480 0.4300 0.4257 0.4375 0.4378 0.4340 Google 0.7200 0.7300 0.7267 0.7150 0.7160 0.7200 0.7143 0.7100 0.7000 0.6920 MSN 0.7600 0.7400 0.7200 0.7150 0.6880 0.6800 0.6857 0.6800 0.6800 0.6840 Yahoo 0.7800 0.7400 0.7333 0.7300 0.6960 0.6733 0.6686 0.6650 0.6711 0.6640 1 2 3 4 5 6 7 8 9 10 226 5.9 Descriptions-Results comparison To explore the search engines performance differences in retrieving search results and presenting these results, the evaluations of results and descriptions are compared using measures introduced by Lewandowski [2008]. Mean distance deviance The distance deviance DRdistn shows how precision of search results vary from the precision of results descrip- tion, where n is the number of results or descriptions ob- served. The following table shows the mean values: Search engine DRdist10 Araby 0.09 Ayna 0.15 Google 0.08 MSN 0.16 Yahoo 0.10 Table 4: Mean distance deviance of top ten descriptions and results MSN and Ayna showed the largest difference between descriptions and results precision then Yahoo and Araby. Google had the lowest average precision difference. We compare the individual description-result pairs on the basis of absolute evaluation values for results and descrip- tions displayed in (Table 5). Description Result Araby Ayna Google MSN Yahoo Relevant Relevant (a) 146 77 221 198 197 Relevant Not relevant (b) 113 124 127 146 137 Not relevant Relevant (c) 76 123 66 58 77 Not relevant Not relevant (d) 165 158 86 98 89 Total number of documents (e) 500 500 500 500 500 Table 5: Individual evaluation counts for description-result pairs Dividing the pair counts (a, b, c, d) by the total number of documents (e), the precision-result comparison measures can be calculated as shown in (Table 6). Comparison measure Araby Ayna Google MSN Yahoo Description-result precision (a/e) 0.29 0.15 0.44 0.40 0.39 Description-result conformance (a+d)/e 0.62 0.47 0.61 0.60 0.57 Description fallout (c/e) 0.15 0.25 0.13 0.12 0.15 Description deception (b/e) 0.33 0.32 0.17 0.20 0.18 Table 6: Description-result comparison measures The best case is when the search engine delivers relevant documents with descriptions that make them appear rele- vant to the user. Description-result precision Google had the highest description-result precision (super precision) followed by MSN then Yahoo. Araby followed with a clear gap, where Ayna was far behind all other en- gines. The recall/precision graph for relevant results described with relevant descriptions is plotted in (Figure 9). Figure 9: Precision graph for the top ten relevant results and descriptions Yahoo performed the best for example at the first rank where Google leads with clear difference at the second rank. MSN kept close to the other two engines on all ranks. Araby precision was clearly lower than the three international search engines, and Ayna, which performed at the second rank better than the first rank, was far be- hind all other engines. Description-result conformance From (Table 6) one can see that Araby performed the best by giving the highest amount of “right” described results followed by Google and MSN with small differences, and then came Yahoo followed by Ayna at the last place de- scribing less than a half of delivered results correctly. Description fallout Considering that a user may not visit a search result if its description seemed to be irrelevant, the description fallout measures the chance of missing relevant results because of their descriptions. Most of the tested search engines performed very closely in this regard except for Ayna which described 25% of its relevant search results with seemingly irrelevant descrip- tions. Description deception A high value of description deception can show that the search engine does not provide proper descriptions for irrelevant retrieved results and may cause a frustrating impression to the user who feels misled by the search en- gine. Google performed the best in this regard by providing the least amount (17%) of irrelevant results associated with descriptions that let them look relevant. The next search engines were Yahoo then MSN with close values (18% and 20% respectively). Clearly, both Arabic search en- gines performed worse with 32% of results with mislead- ing descriptions for Ayna and 33% for Araby. 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 Pr ec is io n Araby 0.4600 0.4100 0.3867 0.3650 0.3640 0.3400 0.3314 0.3050 0.2956 0.2880 Ayna 0.2600 0.2700 0.2200 0.2050 0.1840 0.1700 0.1600 0.1575 0.1511 0.1500 Google 0.5800 0.5900 0.5400 0.5150 0.4880 0.4700 0.4686 0.4625 0.4556 0.4420 MSN 0.5400 0.5100 0.4800 0.4550 0.4320 0.4067 0.4143 0.4000 0.3978 0.3960 Yahoo 0.5800 0.5100 0.5133 0.4800 0.4440 0.4367 0.4286 0.4175 0.4044 0.3920 1 2 3 4 5 6 7 8 9 10 227 6 Conclusions As overall result, it can be concluded that Google was the best search engine in all measures except for the number of not found documents, number of results with no de- scriptions, descriptions fallout and the conformance of results and descriptions. MSN and Yahoo exchanged the second and third ranking places with regards to most evaluation measures except for the number answered queries where they performed equally and for the number of not found document and the conformance of results and descriptions where Yahoo fell back to the fourth rank. Moreover, MSN performed best in terms of not found documents count, count of results with no descriptions and description fallout. Although Araby was mostly on the penultimate place, it showed now significant precision difference to MSN, and delivered the best conformance of results and their de- scriptions, performed better than Yahoo in terms of not found documents count and was equal to it in description fallout. The only last place given to Araby was in descrip- tion deception. The underperformance of Ayna was a remarkable trouble sign. The search engine with the large promotion cam- paign seemed to suffer from very serious problems in both its indexing and searching algorithms and it obviously would need substantial improvement. This test found that there is mostly no significant reason to prefer Google to Yahoo in terms of search performance in Arabic language. One should however keep in mind that Yahoo does not offer an Arabic interface (by the time of the test) which can affect its acceptance. Arab users may also still consider MSN as a potential alternative search engine especially when interested in particular per- formance aspects. The more important finding of the test is that both tested native Arabic search engines could not proof their ability to compete as a local alternative to international search services. Even when Araby had some good results, a wide space for improvement still exists. The results of this state of the art work can be considered for further evaluations and research of Arabic search en- gines, particularly with the absence of similar published studies for this language. References [Abdelali et al., 2004] Ahmed Abdelali, Jim Cowie, and Hamdy S. Soliman. Arabic Information Retrieval Per- spectives. In: Proceedings of JEP-TALN 2004 Arabic Language Processing, Pages 19-22. April. 2004. [Gordon and Pathak, 1999] Michael Gordon and Praveen Pathak. Finding Information on the World Wide Web: the Retrieval Effectiveness of Search Engines. Informa- tion Processing & Management, 35:141-180. [Griesbaum 2004] Joachim Griesbaum. Evaluation of three German search engines: Altavista.de, Google.de and Lycos.de. In: Information Research, 9(4). (Accessed 10.08.09) [Hammo 2009] Bassam H. Hammo: Towards enhancing retrieval effectiveness of search engines for diacri- tisized Arabic documents. In: Information Retrieval, 12(3):300-323, June 2009, Springer, Netherlands. [Hawking et al., 2001] David Hawking, Nick Craswell, Peter Bailey and Kathleen Griffiths. Measuring Search Engine Quality. In: Information Retrieval, 4(1):33-95 Springer, Netherlands. [Jansen et al., 2000] Bernard J. Jansen, Amanda Spink and Tefko Saracevic. Real life, real users, and real needs: a study and analysis of user queries on the Web. In: Information Processing & Management, 36(2):207- 227. [Lazarinis, 2007] Fotis Lazarinis. Web retrieval systems and the Greek language: do they have an understand- ing? In: Journal of Information Science, 33(5):622-636, Sage Publications Inc. [Lazarinis, 2009] Fotis Lazarinis, Jesus Vilares, John Tait and Efthimis Efthimiadis. Current research issues and trends in non-English Web searching. In Information Retrieval, 12(3):230-250, June. 2009. [Lewandowski, 2008] Dirk Lewandowski. The Retrieval Effectiveness of Web Search Engines: Considering Re- sults Descriptions. In: Journal of Documentation, 64(6):915-937. [Lewandowski and Höchstötter, 2008] Dirk Lewandowski, and Nadine Höchstötter. Web Search- ing: A Quality Measurement Perspective. In: A. Spink and M. Zimmer (Editors.): Web Searching: Multidisci- plinary Perspectives. Pages: 309-340, Springer, Berlin. [Mandl, 2008] Thomas Mandl. Recent Developments in the Evaluation of Information Retrieval Systems: Mov- ing Toward Diversity and Practical Applications. In In- formatica - An International Journal of Computing and Informatics, 32:27-38. [Moukdad, 2002] Haidar Moukdad: Language-based re- trieval of Web documents: An analysis of the Arabic- recognition capabilities of two major search engines. In: Proceedings of the 65th Annual Meeting of the American Society for Information Science and Tech- nology, 18-21. November. 2002, Philadelphia, PA. Medford: Information Today, Poster p. 551. [Moukdad, 2004] Haidar Moukdad. Lost in Cyberspace: How do search engines handle Arabic queries? In: Ac- cess to Information: Technologies, Skills, and Socio- Political Context. Proceedings of the 32nd Annual Con- ference of the Canadian Association for Information Science, Winnipeg, 3-5. June. 2004. [Moukdad and Large, 2001] Haidar Moukdad, and An- drew Large. Information retrieval from full-text Arabic databases: Can search engines designed for English do the job? Libri 51(2):63-74. [Schamber, 1994] Linda Schamber. Relevance and infor- mation behavior. Annual Review of Information Science and Technology. 29:3-48. [Tague-Sutclife, 1992] Jean Tague-Sutclife. The Praga- matics of Information Retrieval Experimentation, Re- visited. In: Information Processing & Management, 28(4): 467-490, Elsevier. [Womser-Hacker, 1989] Christa Womser-Hacker. Der PADOK Retrievaltest. Zur Methode und Verwendung statistischer Verfahren bei der Bewertung von Information-Retrieval-Systemen. Hildesheim, Georg Olms. 228 Ein Polyrepra¨sentatives Anfrageverfahren fu¨r das Multimedia Retrieval David Zellho¨fer und Ingo Schmitt Brandenburgische Technische Universita¨t Cottbus Lehrstuhl fu¨r Datenbank- und Informationssysteme david.zellhoefer@tu-cottbus.de Abstract Multimediale Dokumente sind durch eine Viel- zahl an Aspekten gekennzeichnet, die wa¨hrend einer Suche relevant werden ko¨nnen. Die- se Aspekte werden selten ganzheitlich durch Retrieval-Modelle betrachtet, weshalb in dieser Arbeit ein Anfragemodell vorgeschlagen wird, welches das polyrepra¨sentative Prinzip umsetzt. Konzeptionell gesehen basiert dieses Anfrage- modell auf Ergebnissen der Quantenlogik. Das Modell bietet dabei Werkzeuge an, um die Relevanz eines Dokuments gegenu¨ber einer Co- gnitive Overlap aller Dokumentenrepra¨sentation zu berechnen. Durch den Einsatz des vorgestell- ten Verfahrens wird es erstmals mo¨glich, die Co- gnitive Overlap auf Grundlage der Gesetze der Booleschen Algebra zu modellieren. Zusa¨tzlich wird ein Relevance-Feedback-Verfahren zur wei- teren Personalisierung der Ergebnisdokumente pra¨sentiert. Dieses Verfahren nutzt als Eingabe qualitative Pra¨ferenzbewertungen auf den Ergeb- nisdokumenten, welche den Nutzer davon befrei- en, sich mit dem zugrundeliegenden Anfrage- modell oder der Gewichtung der einzelnen Re- pra¨sentationen auseinanderzusetzen. Abschließend zeigt ein Experiment den Nutzen des vorgestellten Anfragemodells. 1 Einleitung Mit der steigenden Verfu¨gbarkeit multimedialer Inhalte wie Videos, Bildern, Musik oder Rich-Media-Applikationen wird es notwendig, geeignete Suchmechanismen anzu- bieten. Multimedia-Dokumente sind durch verschiedene Aspekte charakterisiert, die wa¨hrend eines bestimmten Suchbedu¨rfnisses oder -kontexts relevant werden ko¨nnen. Die Suche wird außerdem dadurch erschwert, dass die Vor- lieben fu¨r einzelne Aspekte seitens der Nutzer je nach Sze- nario teils stark variieren. Wa¨hrend zu einem Zeitpunkt die Farbstimmung eines Bildes besonders wichtig erscheint, kann dies zum na¨chsten Zeitpunkt ausschließlich der Her- steller des Dokuments sein. Aus diesem Beispiel wird klar, dass die Relevanzbewertung eines Dokuments selten sta- tisch ist. Die bereits erwa¨hnte Vielzahl an Aspekten von Multimedia-Dokumenten wird auch in der internen Re- pra¨sentation innerhalb des Retrieval-Systems widergespie- gelt. Diese werden in der Regel in atomare Medientypen wie To¨ne oder Bilder aufgespalten. Die dabei entstehenden Fragmente ko¨nnen dann mittels Low-Level-Features (LLF) wie Farbe oder Textur, High-Level-Features (HLF) wie An- notationen oder aber mittels Metadaten (MD) wie Herstel- ler oder Produktionsdatum beschrieben werden. Aktuelle Verfahren des Multimedia Retrievals basie- ren meist auf Low-Level-Features, da diese im Gegen- satz zu den semantisch aussagekra¨ftigeren High-Level- Features maschinell extrahierbar sind. Momentan erreichen diese Ansa¨tze jedoch eine ”glass ceiling“, die nicht durchdie Hinzunahme weiterer LLF u¨berwunden werden kann [Aucouturier and Pachet, 2004]. Zur U¨berwindung die- ser Barriere werden in der Regel Tags oder Annotationen zusa¨tzlich durch das Retrieval-System ausgewertet. Wei- tere verfu¨gbare Metadaten wie z.B. Exif1 werden oft ver- nachla¨ssigt, da sie sich vor allem gut in relationalen Da- tenbanken (DB) ablegen lassen und dort strukturiert ausge- wertet werden ko¨nnen. Diese Arbeit widmet sich dem diskutierten Problem vom kognitiven Standpunkt aus. Relevanzbewertungen bezu¨glich eines Suchbedu¨rfnisses werden nicht durch ei- ne beliebige Kombination aus LLF oder HLF geta¨tigt. Vielmehr wird Relevanz als eine cognitive overlap [Lar- sen et al., 2006] aller vorliegender Repra¨sentationen ei- nes Dokuments modelliert. Hierbei wird ignoriert, ob die- se Repra¨sentationen einen Ursprung im Information Re- trieval (IR) oder im DB-Bereich haben. Um diesen An- satz zu motivieren, wird das polyrepra¨sentative Prinzip um- rissen. Das darauf folgende Kapitel diskutiert ein polyre- pra¨sentatives Anfragemodell, welches auf Erkenntnissen der Quantenlogik basiert. Dieses Modell wird dazu ver- wendet, Multimedia-Dokumente und die Suche nach ih- nen holistisch abzubilden. Des weiteren wird ein Verfahren zur Personalisierung der Ergebnisse vorgeschlagen. Das pra¨sentierte Modell kann dabei vor allem als Antwort auf die offene Frage nach einem Mechanismus zur Bildung ei- ner Cognitive Overlap mittels strukturierter Anfragen [Lar- sen et al., 2006] gesehen werden. Kapitel 4 zeigt erste Ex- perimente auf Grundlage des theoretischen Modells. Den Abschluss der Arbeit bildet eine Zusammenfassung der Er- gebnisse und der Ausblick auf offene Forschungsfragen. 2 Das Polyrepra¨sentive Prinzip im Information Retrieval Das Prinzip der Polyrepra¨sentation wurde als kognitives Modell im IR vorgeschlagen [Ingwersen and Ja¨rvelin, 2005]. Es basiert auf der Annahme, dass verschiedene Re- pra¨sentationen eines Dokuments dazu ausgenutzt werden ko¨nnen, eine cognitive overlap [Larsen et al., 2006] zu bil- den. Diese U¨berlappung kann durch eine scharfe Schnitt- menge zwischen funktional und kognitiv unterschiedlichen Repra¨sentationen eines Dokuments dargestellt werden, wie 1Exchangeable Image File Format; http://www.exif.org 229 Abbildung 1 zeigt. Dabei versteht man unter funktional un- terschiedlichen Repra¨sentationen Titel, Abstracts, Volltexte etc., die von einem Akteur erstellt wurden. Dem gegenu¨ber stehen kognitiv unterschiedliche Repra¨sentationen, die sich nach dem aktuellen Suchbedu¨rfnis, perso¨nlichen Vorlieben oder Interpretationen durch Dritte, wie indizierende Biblio- thekare richten2. Basierend auf dieser Cognitive Overlap kann die probability of relevance (POR) eines Dokuments bezu¨glich einer Anfrage – oder allgemeiner gesprochen: ei- nes Informationsbedu¨rfnisses – berechnet werden. Aufgrund der Kombination verschiedener funktionaler und kognitiver Repra¨sentationen wird dabei angenommen, dass die intrinsische Unsicherheit von Relevanzbewertun- gen im IR verringert werden kann, was letztendlich zu ei- ner Erho¨hung der Retrievalqualita¨t fu¨hrt [Ingwersen and Ja¨rvelin, 2005]. Diese Hypothese wird auch durch weite- re Arbeiten unterstu¨tzt [Skov et al., 2004]. Zwei wesentliche Erkenntnisse der letztgenannten Ar- beit betreffen die Strukturiertheit der verwendeten Anfrage und die Menge der verwendeten Repra¨sentation, welche die Cognitive Overlap bilden. Hierbei kann festgehalten werden, dass eine große Anzahl an kognitiv und funktional unterschiedlichen Repra¨sentationen die Pra¨zision des IR-Systems erho¨ht. Allerdings muss angemerkt werden, dass der Grad an Strukturiertheit, welcher in der Anfrage vorliegt, ebenfalls einen wesentlichen Einfluss auf die Pra¨zision des Systems hat, da sie letztendlich den Aufbau der Cognitive Overlap bestimmt. Anfragen ko¨nnen dabei von hochstrukturiert, d.h. sie greifen auf Boolesche Junktoren zuru¨ck, bis unstrukturiert, wie z.B. Bag-of- Word-Anfragen, reichen. Laut [Skov et al., 2004] muss hochstrukturierten Anfragen der Vorzug gegeben werden, da sie niedriger strukturierten Anfragen im Rahmen der Retrievalqualita¨t u¨berlegen sind. Diese Erkenntnis spiegelt sich auch in den Ergebnissen von [Hull, 1997; Turtle and Croft, 1991] wieder. Zusammenfassend kann gesagt werden, dass eine hohe Anzahl an Repra¨sentationen eines Dokuments in Kombination mit einer hochstruktu- rierten Anfrage die Retrievalqualita¨t deutlich erho¨ht. Obwohl die bisherigen experimentellen Ergebnisse des po- lyrepra¨sentativen Prinzips vielversprechende Resultate auf- weisen, wurde bis dato kein Framework vorgeschlagen, welches es ermo¨glicht, Cognitive Overlaps hochstrukturiert zu modellieren. Im na¨chsten Abschnitt wird deshalb ein solches Framework diskutiert, welches außerdem die Per- sonalisierung der U¨berlappung ermo¨glicht. 3 Ein Polyrepra¨sentatives Anfragemodell Multimedia-Dokumente sind intrinsisch polyrepra¨sentativ aufgrund der Vielzahl an die Relevanz beeinflussenden Aspekten aus denen sie bestehen und mit denen sie in ei- nem Retrieval-System dargestellt werden; zum einen funk- tional gesehen (LLF, HLF und MD) als auch im kognitiven Bereich, wie z.B. unter Annahme eines bestimmten Such- bedu¨rfnisses mitsamt der kontextabha¨ngigen Pra¨ferenzen zwischen den einzelnen Aspekten. Allerdings bleibt die Frage offen, wie alle existierenden Repra¨sentationen kom- biniert werden mu¨ssen um eine Cognitive Overlap zu mo- dellieren, d.h., wie eine Anfrage formuliert werden kann, die das subjektive Versta¨ndnis von A¨hnlichkeit eines Nut- zers bezu¨glich eines Suchziels wiedergibt. Vorliegende Arbeit aus dem Bereich IR bestimmen diese Anfragen empirisch fu¨r eine gegebene Suchdoma¨ne ohne 2Dabei muss angemerkt werden, dass bereits die Kombinati- on aus einem kontrollierten Vokabular, verschiedenen Stemming- Verfahren o.a¨. polyrepra¨sentativ ist. sich jedoch allgemeinen Suchaufgaben zu widmen. Larsen et al. [Larsen et al., 2006] unterstreichen deshalb den Be- darf an weitergehender Forschung zur Bildung von Cogni- tive Overlaps ”depending on domains, media, genre, andpresentation styles“. Nichtsdestotrotz gehen sie nicht auf den Einfluss der subjektiven Relevanzwahrnehmung ein- zelner Nutzer ein. Letztendlich fu¨hrt diese Subjektivita¨t zu einer unterschiedlichen Auspra¨gung der Cognitive Over- lap obwohl ihre Zusammensetzung a¨hnlich bleibt. Dies schla¨gt sich in unterschiedlich starken Einflu¨ssen einzelner Repra¨sentationen je nach Nutzer oder Anwendungskontext nieder (siehe Abschnitt 3.2). Ein weiterer Kritikpunkt an [Larsen et al., 2006] ist die Annahme von Booleschen Junktoren zur Verknu¨pfung der Repra¨sentationen, was unter Umsta¨nden zu einer leeren Er- gebnismenge fu¨hren kann, wenn kein Dokument die Bedin- gungen, um in die Cognitive Overlap zu gelangen, erfu¨llt. Dieses Problem existiert in unserem Ansatz nicht, da ei- ne A¨hnlichkeit zur Cognitive Overlap berechnet wird. Das heißt, wenn einzelne Bedingungen nicht durch ein Doku- ment erfu¨llt werden, so wird dieses ”abgestraft“. Folglichergibt sich eine weiche Grenzen, weshalb von einem Pene- trable Cognitive Overlap (PCO) gesprochen werden kann (siehe Abbildung 1). Frommholz und van Rijsbergen [Frommholz and van Ri- jsbergen, 2009] erkennen ebenfalls das Problem eines feh- lenden Frameworks, welches das polyrepra¨sentative Prin- zip unterstu¨tzt. Ihre Arbeit basiert, wie die hier vorgestellte, auf Resultaten der Quantenmechanik und -logik [van Rijs- bergen, 2004], diskutiert allerdings nicht die Verwendung von strukturierten Anfragen, um die Cognitive Overlap zu modellieren. Um strukturierte Cognitive Overlaps zu konstruieren, schlagen wir deshalb das folgende Anfragemodell vor. Das Modell besteht aus einer Anfragesprache, die im kommen- den Abschnitt kurz vorgestellt wird, und einem adaptiven Relevance-Feedback-Prozess (RF) [Rocchio, 1971] fu¨r die Nutzerinteraktion und Personalisierung (siehe Abschnitt 3.2). 3.1 Verwendung der Commuting Quantum Query Language im Modell Das vorgestellte Verfahren basiert auf der commuting quantum query language (CQQL) [Schmitt, 2008], wel- che sich auf die Quantenlogik [Birkhoff and Neumann, 1936] stu¨tzt. CQQL bietet als Anfragesprache die Integra- tion von a¨hnlichkeitsbasierten IR-Pra¨dikaten und relatio- nalen DB-Pra¨dikaten. Folglich kann sie deshalb sa¨mtliche Zugriffsparadigmen verwenden, die bei der Verarbeitung und Durchsuchung von Multimedia-Dokumenten existie- ren. Zusa¨tzlich gehorcht CQQL den Gesetzen der Boole- schen Algebra und ist deshalb einer hochstrukturierte An- fragesprache, die alle in [Skov et al., 2004; Hull, 1997; Turtle and Croft, 1991] beschriebenen Vorteile in sich ver- eint. Sie unterstu¨tzt damit die Anforderung, dass ”structu-red Boolean-like query configurations will best support po- lyrepresentation in IR“ [Larsen et al., 2006]. Hierbei muss angemerkt werden, dass die Nutzung einer Booleschen Al- gebra nicht mit den Nachteilen des Booleschen Retrieval- Modells, wie ungeordneten Ergebnislisten oder dem Ver- zicht auf Termgewichtung, gleichzusetzen ist. CQQL nutzt A¨hnlichkeits- bzw. Wahrscheinlichkeitswerte, um die POR eines Dokuments zu bestimmen. Beispiele fu¨r den Einsatz von CQQL zur Konstruktion von Cognitive Overlaps fol- gen weiter unten. Um die theoretischen Grundzu¨ge zu umreissen, werden nachfolgend die konzeptionellen Bezu¨ge zwischen Quan- tenlogik und -mechanik sowie dem polyrepra¨sentativen 230 Index -classification -ontologies ... High-level: -tags -annotations ... Lo w- lev el: -co lor his tog ram ... Re lat ion al me ta da ta: -EX IF ... (Penetrable) Cognitve Overlap Representation ...Search task Abbildung 1: Venn-Diagramm verschiedener Dokumentrepra¨sentationen, die eine Penetrable Cognitive Overlap bilden. Prinzip skizziert. In der Quantenmechanik kann jeder Zu- stand eines mikroskopischen Objekts durch einen nor- malisierten Zustandsvektor |ϕ〉3 beschrieben werden. Die Zusta¨nde werden in einem Hilbertraum H beschrieben – vereinfacht gesagt: einem reellwertigen Vektorraum mit ei- nem Skalarprodukt. Im vorgestellten Ansatz interpretieren wir einen Zustandsvektor als eine mo¨gliche Repra¨sentation eines Dokuments, z.B. ein konkretes LLF. Um den Zustand eines Systems zu messen, nutzen wir Projektoren. Ein Projektor p = ∑ i |i〉〈i| ist ein symme- trischer und idempotenter linearer Operator, der u¨ber einer Menge von orthonormalen Vektoren |i〉 definiert wird. Die Multiplikation eines Projektors mit einem Zustandsvektor entspricht der Projektion des Vektors in den entsprechen- den Vektorunterraum. In anderen Worten wird die Wahr- scheinlichkeit berechnet mit welcher ein Zustandsvektor fu¨r einen Aspekt der Cognitive Overlap relevant ist. Die A¨hnlichkeit des Projektors p und eines Zustandsvektors |ϕ〉 ist dabei wie folgt definiert: 〈ϕ|p|ϕ〉 = 〈ϕ|( ∑ i |i〉〈i|)|ϕ〉 = ∑ i 〈ϕ|i〉〈i|ϕ〉 Interpretiert man diese Quantenmessung nun als Wahr- scheinlichkeitswert, dann ist dieser Wert gleich der qua- drierten La¨nge des Zustandsvektors nach der Projektion in den Vektorunterraum, welcher durch die Vektoren |i〉 aufgespannt wird. Außerdem entspricht der Wert aufgrund der Normalisierung dem quadrierten Kosinus des minima- len Winkels zwischen |ϕ〉 und dem durch p repra¨sentierten Vektorunterraum. Sollen mehrere Quantensysteme gemes- sen werden, d.h. mehrere Repra¨sentation zum Einsatz kom- men, wird das Tensorprodukt zu deren Verknu¨pfung ver- wendet. Das weitere Vorgehen gleicht dem bereits geschil- derten. Fu¨r Details, insbesondere zur Verknu¨pfung einzel- ner Quantensysteme durch das Tensorprodukt und die dar- aus folgenden theoretischen Eigenschaften sowie Beweise wird aus Platzgru¨nden auf [Schmitt, 2008] verwiesen. Ta- belle 1 fasst die bisherigen Erkenntnisse zusammen. Aufgrund des Fokus’ dieser Arbeit wird fu¨r notwendige Normalisierungsschritte und -algorithmen, die vor der Aus- wertung einer CQQL-Anfrage no¨tig sind, auf [Schmitt, 2008] verwiesen. Stattdessen wird direkt die resultierende algebraische Auswertung einer CQQL-Anfrage vorgestellt, da sie spa¨ter fu¨r die Messung der POR eines Dokuments notwendig ist. 3In Anlehnung an die u¨bliche Schreibweise innerhalb der Quantenmechanik greifen wir hier auf die Dirac-Notation fu¨r Vektoren zuru¨ck. Bei dieser wird ein Spaltenvektor als |x〉 an- gegeben wa¨hrend 〈y| einen Zeilenvektor bezeichnet. 〈x|y〉 dru¨ckt folglich das Skalarprodukt beider Vektoren aus. Tabelle 1: Gegenu¨berstellung der Konzepte zwischen CQQL und dem Polyrepra¨sentativen Prinzip CQQL Polyrepra¨sentation Zustandsvektor Repra¨sentation Projektor Cognitive Overlap Quantenmessung Probability of Relevance bzgl. d. Cognitive Overlap Es sei fϕ(d) die Evaluierung eines Dokuments d bezu¨glich eines PCO ϕ und ϕ1 ∧ ϕ2, ϕ1 ∨ ϕ2 sowie ¬ϕ mittels logischen Junktoren verbundene Repra¨sentation (Pra¨dikate), welche die PCO modellieren. Ist ϕ ein atoma- res Pra¨dikat, so kann fϕ(d) direkt in einen numerischen Wert im Intervall [0; 1] ausgewertet werden, welcher die POR von d bezu¨glich der PCO darstellt. Ansonsten werden die CQQL-Pra¨dikaten rekursiv entsprechend der folgenden Formeln ausgewertet: fϕ1∧ϕ2(d) = fϕ1(d) ∗ fϕ2(d) fϕ1∨ϕ2(d) = fϕ1(d) + fϕ2(d)− fϕ1(d) ∗ fϕ2(d) wenn ϕ1 und ϕ2 nicht exklusiv sind fϕ1∨ϕ2(d) = fϕ1(d) + fϕ2(d) wenn ϕ1 und ϕ2 exklusiv sind f¬ϕ(d) = 1− fϕ(d). Hierbei heißt eine Disjunktion exklusiv, wenn sie folgende Form hat: (ϕi ∧ . . .) ∨ (¬ϕi ∧ . . .) fu¨r ein ϕi. Hier wird deutlich, dass jede Evaluierung mittels einfacher arithmeti- scher Operationen durchgefu¨hrt wird, welche den Gesetzen der Booleschen Algebra gehorchen. Aufgrund der disjunk- tiven Normalform, die vor der Auswertung erreicht wird [Schmitt, 2008], kann jede Evaluierung durch eine Sum- me von Produkten atomarer Pra¨dikate auf einzelnen Re- pra¨sentationen ausgedru¨ckt werden. Folglich wird die PCO beliebiger Repra¨sentationen in einer hochstrukturierten Art modelliert und anschließend mittels einfacher arithmeti- scher Ausdru¨cke ausgewertet. Das Ergebnis dieser Auswer- tung ist die POR eines Dokuments im Intervall [0; 1]. Aus der Gegenu¨berstellung der Evaluierungsre- geln von CQQL mit Kolmogorovs Axiomen wird die Beziehung zwischen Quantenmessungen und der Wahrscheinlichkeitstheorie deutlich: P (X ∩ Y ) = P (X) ∗ P (Y ), P (X ∪ Y ) = P (X) + P (Y ) (fu¨r sich gegenseitig ausschließende Ereignisse), und P (X ∪ Y ) = P (X) + P (Y )− P (X ∩ Y ) Hierbei ist P die Wahrscheinlichkeit eines Ereignisses X 231 oder Y . Die Negation ist analog definiert. Eine weitere wesentliche Charakteristik von CQQL ist die Einbettung von Gewichten in die Sprache, was die Per- sonalisierung der PCO erleichtert (siehe Abschnitt 3.2). Die Gewichte θi dienen in CQQL dazu, unterschiedliche Wichtigkeiten einzelner Pra¨dikate zu bestimmen. Beson- ders dabei ist, dass diese Gewichte nicht die Gesetze der Booleschen Algebra verletzen, wie das z.B. bei [Fagin and Wimmers, 2000] der Fall ist, die Gewichtungen außerhalb ihrer Anfragesprache ermo¨glichen. Gewichte in einer gewichteten CQQL-Anfrage qθ sind direkt mit Junktoren verknu¨pft, wie das folgende Beispiel zeigt: qθ = (ϕ1∧θ1,θ2ϕ2)∧θ3,θ4ϕ3. Zur Auswertung wer- den die Gewichte durch Konstanten wie folgt ersetzt: ϕ1 ∧θ1,θ2 ϕ2 ; (ϕ1 ∨ ¬θ1) ∧ (ϕ2 ∨ ¬θ2) ϕ1 ∨θ1,θ2 ϕ2 ; (ϕ1 ∧ θ1) ∨ (ϕ2 ∧ θ2) Ungewichtete Anfragen werden durch das Setzen aller Ge- wichte auf 1 ermo¨glicht. Folgende Beispielanfrage soll den Nutzen von Gewichten verdeutlichen. qθ: Finde alle Dokumente, die einem gegebenen Bild Img a¨hneln und im Jahr 2009 erstellt wurden. qθ := (year = 2009) ∧θ1,θ2 (edges ≈ Img.edges ∧θ2,θ3 colorLayout ≈ Img.colorLayout ∧θ4,θ5 (blue ≈ Img.blue ∨θ6,θ7 orange ≈ Img.orange)) In diesem Beispiel wird das Jahr als Boolesches Pra¨dikat, z.B. aus relationalen Metadaten ausgelesen, angenommen, wa¨hrend die anderen Bedingungen A¨hnlichkeitsmaße auf Grundlage von LLF darstellen. Die Gewichte θi dienen da- bei zur Steuerung des Einflusses der einzelnen Bedingun- gen auf die Evaluierung von qθ. 3.2 Nutzerinteraktion zur Personalisierung der Penetrable Cognitive Overlap Ein Problem, welches ha¨ufig bei der Modellierung von Co- gnitive Overlaps vernachla¨ssigt wird, ist die Subjektivita¨t von Nutzerpra¨ferenzen zwischen den verschiedenen Re- pra¨sentationen. Wa¨hrend ein Nutzer sehr empfindlich auf Texturen, die in einem Bild existieren, reagiert, ist es denk- bar, dass ein anderer besonderen Wert auf eine bestimm- te Farbto¨nung und vorhandene Annotationen legt. Obwohl durchaus fu¨r beide Suchbedu¨rfnisse die PCO durch die gleiche strukturierte Anfrage modelliert werden kann, ist es offensichtlich, dass die Nutzer nicht u¨ber die Wichtig- keit der einzelnen Repra¨sentationen, welche die PCO aus- machen, u¨bereinstimmen werden. Betrachten wir deshalb ein Beispiel: Eine multimediale Dokumentensammlung eines Muse- ums soll nach Dokumenten der Renaissance durch- sucht werden. Die Sammlung besteht aus digitalisierten Gema¨lden, Skulpturen, Photographien und Bu¨chern. Der jeweiligen Medientyp eines Dokuments ist neben Attri- buten wie Hersteller, Herstellungsjahr, historischer Ort etc. in einer relationalen DB abgelegt. Desweiteren stehen LLF und HLF in einem Retrieval-System zur Verfu¨gung. In diesem Szenario wird der PCO durch verschiedene funk- tional und kognitiv unterschiedliche Repra¨sentationen ei- nes Dokuments, beispielsweise das Jahrhundert der Her- stellung, den Medientyp oder vom Medientyp abha¨ngige LLF sowie Annotationen, strukturiert formuliert. Eine ent- sprechende CQQL-Anfrage kann wie folgt aussehen: q := ¬(mediaType = ”photography”) ∧ (century ≈ 15) ∧ ((mediaType = ”painting” → dominantColor ≈ RGB(...) ∧ ...) ∨ (mediaType = ”sculpture” → ...) ∨ (mediaType = ”book → ...”)) ∧ (subject ≈ ”madonna” ∨ subject ≈ ”pope” ∨ ...) ∧ ... 4 Aufgrund der Strukturiertheit von CQQL wird es mo¨glich Implikationen (→) anzugeben, um beispielsweise spezi- elle Mengen von Pra¨dikaten anzugeben, die im Falle ei- nes bestimmten Medientyps ausgewertet werden. Diese Mo¨glichkeit widmet sich direkt dem ”multi-domain pro-blem“ [Larsen et al., 2006] und ermo¨glicht es, verschiede- ne Medientypen innerhalb einer logischen Anfrage zu in- tegrieren. Im gegebenen Beispiel dient beispielsweise das LLF ”Dominant Color“ als funktionale Repra¨sentationeneines Bildes, wa¨hrend unterschiedliche Repra¨sentationen fu¨r andere Medientypen Einsatz finden. Photographien werden von vornherein ausgeschlossen, da sie zur Re- naissance nicht existierten. Obwohl die allgemeine Konstruktion einer Cogni- tive Overlap fu¨r das Renaissance-Suchszenario leicht versta¨ndlich ist, kann davon ausgegangen werden, dass Nutzer unterschiedliche, subjektive Pra¨ferenzen zwischen den funktionalen und kognitiven Repra¨sentationen haben. So wird ein Nutzer den Schwerpunkt auf das Jahrhundert legen, wa¨hrend ein anderer besonderes Augenmerk auf be- stimmte Farbto¨ne5 oder dargestellte Themen legt. Um die PCO entsprechend zu personalisieren bietet sich die Ge- wichtung der einzelnen Repra¨sentationen an. Gewichtungen haben seit den fru¨hen Extended-Boolean- Retrieval-Modellen eine Tradition im IR. Ihr positiver Ef- fekt auf die Nutzerzufriedenheit und die Personalisierung gilt dabei als gesichert [Hull, 1997; Salton et al., 1983; Lee et al., 1993]. Nichtsdestotrotz mu¨ssen diese Gewichte auf einer empirischen Basis bestimmt oder explizit durch den Nutzer gesetzt werden. Insbesondere das Letztgenann- te stellt eine komplizierte Aufgabe dar, die auch Kenntnisse des zugrundeliegenden Retrieval-Modells voraussetzt. Als Beispiel fu¨r die Integration von gewichteten Re- pra¨sentationen in die Cognitive Overlap ko¨nnen [Skov et al., 2004] genannt werden. In ihrem Ansatz werden Gewichte wa¨hrend der Aggregation der verschiedenen A¨hnlichkeitswerte verwendet. Konzeptionell a¨hnelt dieser Ansatz Fagin und Wimmers Vorgehen [Fagin and Wim- mers, 2000], die ebenfalls Anfragelogik und Gewichtung trennen. Im Gegensatz dazu integriert CQQL die Gewich- tung in die Logik, was die Kompatibilita¨t mit der Boo- leschen Algebra garantiert und verhindert, dass sich die Semantik der Anfrage durch die Gewichtung vera¨ndert. U¨bertragen auf das letzte Beispiel fu¨hrt das zu folgender Modifikation der Anfrage: qθ := ¬(mediaType = ”photography”) ∧θ1,θ2 (century ≈ 15) ∧θ3,θ4 ((mediaType = ”painting” → dominantColor ≈ RGB(...)... Im vorgestellten Verfahren werden die Gewichte nach einem logischen Transformationsschritt (siehe Abschnitt 3.1) wa¨hrend der arithmetischen Evaluierung ausgewer- tet [Zellho¨fer and Schmitt, 2009]. Diese Integration ermo¨glicht die flexible Adaption der Gewichtung an die subjektive Wahrnehmung des Nutzers ohne die Semantik 4Sonderfall Evaluierung: Wenn eine Anfrage wie q := x ≈ A ∨ x ≈ B gegeben ist, dann gilt q ≡ ∃t : (t = A ∨ t = B)∧ x ≈ t, d.h., erst die Anwendung von Transformationsregeln der Logik 1. Ordnung ermo¨glicht die U¨bersetzung der Anfrage in eine Form, die direkt von CQQL ausgewertet werden kann. 5Hier muss angemerkt werden, dass die O¨lfarben der Re- naissance charakteristisch altern, so dass Fachleute den kulturel- len Ursprung eines Gema¨ldes scha¨tzen ko¨nnen. 232 der Anfrage zu verletzen. Dieses flexible Vorgehen ist bei zuvor festgelegten Gewichten, wie z.B. wa¨hrend der oben genannten Aggregation verschiedener Repra¨sentationen, nicht ohne weiteres mo¨glich. Wie bereits erwa¨hnt, stellt das Setzen der Gewichte ei- ne komplexe Aufgabe fu¨r den Nutzer dar. Fu¨r eine Einga- be beno¨tigt der Nutzer dabei Wissen um die Anfragespra- che und Klarheit u¨ber seine subjektiven Pra¨ferenzen zwi- schen allen Repra¨sentationen, was die kognitive Belastung erho¨ht. Um eine einfache Bedienung zu gewa¨hrleisten, wird der Nutzer im vorgestellten Anfragemodell von ei- nem RF-Prozess unterstu¨tzt, der auf maschinelles Lernen zuru¨ckgreift. Nutzerinteraktion Die Kernidee der Nutzerinteraktion basiert auf der Verwen- dung qualitativer Pra¨ferenzurteile als Eingabeverfahren fu¨r den Nutzer. Nach einer initialen Anfrage der Dokumenten- kollektion werden potentiell relevante Dokumente darge- stellt. Ist der Nutzer mit dieser Ergebnismenge unzufrie- den, so kann die Anfrage und somit die Auspra¨gung der PCO iterativ angepasst werden. Dabei wird auf induktive Pra¨ferenzen [Zellho¨fer, 2010b] auf den Ergebnisdokumen- ten zuru¨ckgegriffen. Induktive Pra¨ferenzen ermo¨glichen es dem Nutzer eine Pra¨ferenz zwischen zwei oder mehr Do- kumenten wie Dokument d1 ist relevanter als Dokument d2 anzugeben. Abbildung 2 illustriert dies am Beispiel von Dokument #4 ist besser als #3. Das besondere an die- sem Vorgehen ist, dass diese Beurteilung ohne Kenntnis der zugrundeliegenden Repra¨sentationen oder der Anfra- gesprache erfolgen kann. Vergleichbare Urteile sind aus dem Alltag bekannt, in dem ha¨ufig zwischen zwei Objek- ten gewa¨hlt wird, ohne dass alle ihre Eigenschaften detail- liert bekannt sind. Folglich erho¨ht das vorgestellte Verfah- ren nicht die kognitive Last des Nutzers, was nicht bei allen RF-Verfahren der Fall ist [White, 2006]. Nach jeder Nutzerinteraktion mu¨ssen die Pra¨ferenzen einen gerichteten, azyklischen Graphen (ein Hasse- Diagramm) bilden, welcher als Eingabe fu¨r den maschi- nellen Lernalgorithmus dient, der die Gewichte der zugrun- deliegenden CQQL-Anfrage entsprechend der Anforderun- gen des Nutzers anpasst. Es handelt sich dabei um einen angepassten Downhill-Simplex-Algorithmus [Nelder and Mead, 1965], der zur Lo¨sung nichtlinearer Optimierungs- probleme geeignet ist. Aufgrund der arithmetischen Aus- wertung von CQQL (siehe Abschnitt 3.1) fa¨llt das Lernver- fahren in diese Klasse. Die angegebenen Pra¨ferenzen die- nen dabei als Constraint. Eine detaillierte Beschreibung der zugrundeliegenden Algorithmen findet sich in [Zellho¨fer and Schmitt, 2009; Schmitt and Zellho¨fer, 2009] und wird hier aufgrund des anderen thematischen Fokus’ ausgespart. Das Verfahren sammelt folglich im Laufe des Verfahrens eine Menge an Pra¨ferenzen an, die das Suchbedu¨rfnis des Nutzers charakterisieren. In diesem Punkt a¨hnelt es dem ostensive model [Campbell, 2000], wobei jedoch mit der Zeit Pra¨ferenzen nicht durch einen Einflussfaktor ”altern“,sondern durch weitere erga¨nzt oder ersetzt werden. Im Falle von unerfu¨llbaren Pra¨ferenzen werden diese durch den Algorithmus kommuniziert. Ein Beispiel fu¨r ei- ne unerfu¨llbare Pra¨ferenz ist ein Zyklus wie: Dokument A > B > C aber C > A. Solche Konflikte ko¨nnen ein- fach durch ein topologisches Sortieren bei jeder Modifika- tion des Pra¨ferenzgraphens erkannt werden. Je nach Art des Konflikts kann eine automatische Auflo¨sung durch Prio- risierung oder eine Pareto-Komposition angewendet wer- den [Zellho¨fer, 2010a]. Durch die iterative U¨berpru¨fung des Graphen ist ebenfalls eine manuelle Korrektur durch den Nutzer mo¨glich. Hierbei wird der Nutzer vom Sys- tem dadurch unterstu¨tzt, dass es betroffene Pra¨ferenzen pra¨sentiert und so eine klare Fallentscheidung ermo¨glicht. Letztendlich fu¨hren die geschilderten Lernschritte zu ei- ner Adaption der PCO an die individuellen Vorstellun- gen des Nutzers. Dabei basiert die PCO nach wie vor auf einer hochstrukturierten Anfrage. Um die Retrieval- qualita¨t zu steigern ist es denkbar, dass Anfragen durch Doma¨nenexperten vorformuliert oder empirisch bestimmt werden. Die Gewichte innerhalb der Anfragen steuern dann sozusagen den ”Trend“ innerhalb der Anfrageergebnisse,um diese weiter zu personalisieren. 4 Experimente und Nutzungsszenarien Trotz des theoretischen Schwerpunkts der Arbeit wurde ein erstes Experiment durchgefu¨hrt, um den Nutzen des po- lyrepra¨sentativen Anfragemodells zu bewerten. Zusa¨tzlich wurde das Verfahren in einer Real-Welt-Anwendung ein- gesetzt. Das Experiment basiert auf einem Image-Retrieval- Szenario, welches auf 575 Urlaubsfotos aus Indonesien zuru¨ckgreift. Die PCO, d.h. die CQQL-Anfrage, wurde als gewichtete Konjunktion einiger LLF, die durch die LIRe-Bibliothek [Lux and Chatzichristofis, 2008] extra- hiert wurden, modelliert. Um eine große Anzahl von Re- pra¨sentationen eines Dokuments bereitzustellen wurden die folgenden LLF ausgewa¨hlt: SCALABLECOLOR, CO- LORLAYOUT, EDGEHISTOGRAM, AUTOCOLORCOR- RELOGRAM, COLORHISTOGRAM, GABOR, TAMURA, CEDD und FCTH. Die ersten Features entstammen da- bei dem MPEG-7-Descriptor-Set. Die letzen beiden sind aggregierte Features und werden in [Lux and Chatzichri- stofis, 2008] referenziert. Im aktuellen Experiment wurden noch keine auf HLF oder MD basierenden Repra¨sentation verwendet, da vor allem die Wirksamkeit der induktiven Pra¨ferenzen und des folgenden Lernalgorithmus’ unter- sucht werden sollen. Abbildung 5 (A) zeigt die gerankte Ergebnisliste basie- rend auf dem genannten PCO und einer initialen, neutra- len Gewichtung6. Die erste Iteration nutzt eine explizit an- gegebene Pra¨ferenz als Eingabe fu¨r den Lernalgorithmus: das Dokument #5 (”Tempel“) wird gegenu¨ber dem Do-kument #4 (”Hund“) bevorzugt. Abbildung 5 (B) zeigtdie darauf folgenden gerankten Ergebnisdokumente. Hier wird deutlich, dass bereits mehr Tempel in den oberen Resultaten erscheinen. Dies wird auch durch den Rank- fehler deutlich. Der Rankfehler (Tabelle 2 [Verfahren 2]) za¨hlt Vera¨nderungen an den Rangpositionen der Dokumen- te zwischen zwei Rankings, womit sich der Einfluss der Pra¨ferenzen darstellen la¨sst. Abbildung 4 (unten) zeigt den Rankfehler zwischen dem initialen Rank und dem der ers- ten Iteration. Die schwarze Diagonale zeigt die Rankpositi- on eines Dokuments nach der ersten Iteration wa¨hrend die roten Kreuze die urspru¨ngliche Position dieses Dokuments wiedergeben. D.h., dass das Dokument, welches nach der Pra¨ferenzmodifikation auf Position #3 gerankt ist (3 auf der x-Achse), davor auf Rank #8 war (y-Achse). Abbil- dung 5 (C) zeigt den neuen Rank nach der Hinzunahme zweier weiterer Pra¨ferenzen. Alternativ kann ein anderes Verfahren der Pra¨ferenzerhebung Verwendung finden (Tabelle 2 [Verfahren 1]). In diesem Verfahren werden nicht nur explizit angegebene Pra¨ferenzen, wie eingangs geschildert, verwendet, sondern das Ranking der Ergebnisdokumente 6Es ist nicht notwendig mit einer neutralen Gewichtung zu beginnen. Beispiele fu¨r andere initiale Gewichtungen, wie die Verwendung von Nutzerprofilen, finden sich in [Zellho¨fer and Schmitt, 2008]. 233 1 2 3 4 < Abbildung 2: Erhebung einer Pra¨ferenz zwischen Dokument 3 und 4. Dokument 4 wird bevorzugt. Tabelle 2: Kennzahlen der beiden Verfahren zur Pra¨ferenzableitung Verfahren 1 Verfahren 2 Rankfehler (top-30) 26 23 Rankdistanz 0,08888889 0,11555555 Gewichtsdistanz 5,1999 6.639 Spearmans ρ 0,909 0,843 ausgenutzt. Bei diesem Verfahren werden Pra¨ferenzen aus dem Ranking solange abgeleitet bis vom Nutzer eine Modifikation vorgenommen wird, d.h. fu¨r das konkrete Beispiel, dass neben #5 > #6 auch noch die Pra¨ferenzen #1 > #2 und #2 > #3 hinzugenommen werden. Dieses Verfahren wirkt sich auch auf den neu generierten Rank aus, wie der unterschiedliche Rankfehler in Tabelle 2 [Verfahren 1] zeigt. Da das reine Abza¨hlen von Positions- vera¨nderungen noch nicht sehr aussagekra¨ftig ist, stellt Tabelle 2 weitere Kennzahlen dar. Die Rankdistanz δrank(rx, ry) (Gleichung 1; vgl. [Zellho¨fer and Schmitt, 2009]) dru¨ckt die durchschnittli- che Positionsdifferenz der Dokumente in zwei Ranks aus und kann damit aussagekra¨ftiger als der Rankfehler aus- dru¨cken, um wieviel Positionen Dokumente im Rank ver- schoben worden. δrank(rx, ry) = ∑n i=1 |posrx(di)− posry (di)| n2 . (1) Wobei rx und ry zwei unterschiedliche Ranks sind, posrx(di) die Position von Dokument di in rx und posry (di) die Position in ry ist. Beide Ranks bestehen da- bei aus jeweils n Positionen. Die Euklidische Gewichtsdistanz misst vor allem die Auswirkung angegebener Pra¨ferenzen auf die gelernten Gewichtswerte: δweights(w1, w2) = √∑ i |w2(i)− w1(i)|2 (2) Hierbei sind w1 und w2 die gelernten Gewichtsschemata auf Grundlage verschiedener Pra¨ferenzen. Außerdem wurde der Rangkorrelationskoeffizient Spe- armans ρ berechnet, der in einem Intervall von [−1; 1] liegt. Der Wert 1 steht dabei fu¨r identische Ranks und −1 fu¨r maximal unterschiedliche, d.h. umgedrehte Ranks. Im vorgestellten Experiment misst Spearmans ρ vor allem die Sta¨rke der Vera¨nderung der Ranks nach einer Iteration. Stellt man beide Verfahren gegenu¨ber, so wird deutlich, dass die alleinige Verwendung explizit angegebe- ner Pra¨ferenzen (Verfahren 2) zu einer deutlicheren Vera¨nderung eines Ranks nach einem Lernschritt fu¨hrt. Ob sich dieser Unterschied auf die Erwartungskonformita¨t des Systems auswirkt muss in Nutzerstudien untersucht werden. Zusammenfassend kann gesagt werden, dass das durchgefu¨hrte Experiment eine Wirksamkeit des vorge- schlagenen, polyrepra¨sentativen Anfragemodells aufzeigt und weitere, tiefergehende Studien motiviert. In Erga¨nzung des durchgefu¨hrten Experiments wurde das vorgestellte Verfahren im Music-Retrieval-Projekt ”Glo-balMusic2one“7 eingesetzt. Ziel dieses Projekts ist die Ent- wicklung eines Retrieval-Systems fu¨r Weltmusik, welches LLF, Tags, Folksonomien und regelbasierte Klassifikatio- nen von Musik-Genres vereint. Die Klassifikationsregeln wurden dabei von Musikwissenschaftlern der HU Berlin bereit gestellt. Abbildung 3 zeigt eine solche Klassifikati- onsregel. Im Rahmen eines Tests wurde eine PCO auf Grundla- ge dieser Regeln, die zum Teil auf LLF und relationalen Metadaten (Songla¨nge, Interpret o.a¨.) basieren, modelliert, um die A¨hnlichkeitssuche von Songs bzw. die Klassifika- tion dieser zu unterstu¨tzen. Subjektive Einscha¨tzungen von Doma¨nenexperten zeigen dabei, dass die Verwendung des vorgeschlagenen Anfragemodells der alleinigen Verwen- dung von LLF u¨berlegen ist. Es wird vermutet, dass dies an der strukturellen Ma¨chtigkeit von CQQL liegt, welche es ermo¨glicht, polyrepra¨sentative Konzepte, die der Musik inha¨rent sind, korrekt widerzuspiegeln. Obwohl die ersten Untersuchungen bereits motivieren- de Ergebnisse liefern, sind weitere Experimente notwen- dig, um belastbare Erkenntnisse zu gewinnen. Offenbar ist die strukturierte Kombination von Repra¨sentation aus ver- schiedenen Retrieval-Doma¨nen mo¨glich. Wobei die Ein- beziehung von Gewichten vor allem der Personalisierung der Ergebnismenge dient, welche letztendlich die Nutzer- zufriedenheit erho¨ht. AND AND AND AND AND ANDAND Origin in Balkan Trumpet Balkan Language Song length < 10 min TromboneSpeed = fast Tuba Saxophone 0.9 0.9 0.9 0.8 0.8 0.8 0.6 0.5 Abbildung 3: Auszug aus dem CQQL-Bedingungsbaum fu¨r das Genre ”gypsy brass music“. (Die Gewichte fu¨r dieKonjunktionen sind an den Kanten angegeben.) 5 Fazit und Ausblick Das vorliegende Paper stellt ein kognitiv motiviertes An- fragemodell fu¨r das Multimedia Retrieval vor. Das vorge- 7http://www.globalmusic2one.net/ 234 schlagene Modell nutzt CQQL, um das polyrepra¨sentative Prinzip [Ingwersen and Ja¨rvelin, 2005] mittels einer hoch- strukturierten Anfragesprache umzusetzen. Zeitgleich wird der Nutzer durch ein RF-Verfahren, welches Methoden des maschinellen Lernens verwendet, bei der Personalisierung der entstandenen Cognitve Overlap aller zur Verfu¨gung ste- henden Repra¨sentationen eines Dokuments unterstu¨tzt. Das gesamte Verfahren stellt dabei ein theoretisch begru¨ndetes Framework fu¨r Polyrepra¨sentation dar, welches leicht auf andere Retrieval-Bereiche ausgedehnt werden kann. Das angesprochene maschinelle Lernverfahren wird zur Personalisierung der Cognitive Overlap eingesetzt, um die Nutzerzufriedenheit zu erho¨hen. Zur Gewa¨hrleistung einer niedrigen Belastung fu¨r den Nutzer greift das RF auf induktive Pra¨ferenzen zuru¨ck, mit denen es mo¨glich ist, ”besser-als“-Beziehungen auf den Ergebnisdokumentenanzugeben. Hierbei bleibt das zugrundeliegende Anfrage- modell fu¨r den Nutzer verborgen. So kann eine intuitive Bedienung sichergestellt werden. Im folgenden sollen einige offene theoretische und prakti- sche Fragen vorgestellt werden. Durch die Umsetzung des polyrepra¨sentativen Prinzips erscheint es sinnvoll das dis- kutierte Framework in das ”polyrepresentation continuum“[Larsen et al., 2006] einzuordnen. Neben dieser theoreti- schen Klassifizierung steht die Entwicklung einer graphi- schen Oberfla¨che fu¨r das Anfragemodell zur Durchfu¨hrung von Nutzerstudien im Mittelpunkt. Dabei ist geplant das polyrepra¨sentative Prinzip auch in der GUI widerzuspie- geln. Diese Idee wird durch [White, 2006] unterstu¨tzt, welcher das Prinzip erfolgreich zur Dokumentenvisualisie- rung und Navigation einsetzt. Die Visualisierung betrach- ten wir dabei als besondere Herausforderung, da sie die polyrepra¨sentativen Charakteristika eines Dokuments wie- dergeben muss. Hierfu¨r reicht die simple Darstellung von Thumbnails oder Video-Zusammenfassungen nicht mehr aus. Auf der anderen Seite sind weitere Experimente no¨tig, um die Retrieval-Qualita¨t des diskutierten Ansatzes zu be- werten. Aufgrund des starken Einflusses an Subjektivita¨t durch die Verwendung von induktiven Pra¨ferenzen bieten sich hier die traditionellen Maße wie Precision und Re- call nur beschra¨nkt an. Vielmehr ist der Einsatz des Nor- malized Discounted Cumulative Gains (NDCG) [Ja¨rvelin and Keka¨la¨inen, 2002] sinnvoll, da es gestufte Relevanz- beurteilungen zula¨sst, mit denen bei dem vorliegen Anfra- gemodell zu rechnen ist. Als Ausgangsbasis dient die be- reits erwa¨hnte Photosammlung mit 575 Dokumenten, die bereits mittels abgestufter Relevanzurteile durch verschie- dene Nutzer bewertet wird. Im Anschluss an diese Unter- suchungen sind erste Usability-Tests der GUI denkbar. Mittelfristig muss die Modifikation der PCO untersucht werden. Dies wird notwendig, da damit zu rechnen ist, dass sich das Suchbedu¨rfnis der Nutzer verschieben kann und so nicht mehr durch die initiale PCO abgedeckt wird. Hierbei kann das diskutierte Lernverfahren modifiziert werden, so dass konkrete CQQL-Anfragen gelernt werden ko¨nnen, die aus angegebenen Pra¨ferenzen abgeleitet werden. Literatur [Aucouturier and Pachet, 2004] J J. Aucouturier and F. Pa- chet. Improving Timbre Similarity: How high is the sky? In Journal of Negative Results in Speech and Audio Sciences, volume 1 of 1. 2004. [Birkhoff and Neumann, 1936] G. Birkhoff and J. Neu- mann. The Logic of Quantum Mechanics. Annals of Mathematics, 37:823–843, 1936. [Campbell, 2000] Iain Campbell. Interactive Evaluation of the Ostensive Model: Using a New Test Collection of Images with Multiple Relevance Assessments. Inf. Retr., 2(1):89–114, 2000. [Fagin and Wimmers, 2000] R. Fagin and L. E. Wimmers. A Formula for Incorporating Weights into Scoring Ru- les. Special Issue of Theoretical Computer Science, (239):309–338, 2000. [Frommholz and van Rijsbergen, 2009] I. Frommholz and C.J. van Rijsbergen. Towards a Geometrical Model for Polyrepresentation of Information Objects. In Proc. of the ”Information Retrieval 2009” Workshop at LWA 2009, 2009. [Hull, 1997] A. David Hull. Using Structured Queries for Disambiguation in Cross-Language Information Retrie- val. In AAAI Spring Symposium on Cross-Language Text and Speech Retrieval Electronic Working Notes, pa- ges 24–26, 1997. [Ingwersen and Ja¨rvelin, 2005] Peter Ingwersen and Ka- lervo Ja¨rvelin. The Turn: Integration of Information Seeking and Retrieval in Context. Springer-11645 /Dig. Serial]. Springer, Dordrecht, 2005. [Ja¨rvelin and Keka¨la¨inen, 2002] Kalervo Ja¨rvelin and Jaa- na Keka¨la¨inen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4):422–446, 2002. [Larsen et al., 2006] Birger Larsen, Peter Ingwersen, and Jaana Keka¨la¨inen. The polyrepresentation continuum in IR. In IIiX: Proceedings of the 1st international confe- rence on Information interaction in context, pages 88– 96. ACM, 2006. [Lee et al., 1993] H. J. Lee, Y. W. Kim, H. M. Kim, and J. Y. Lee. On the Evaluation of Boolean Operators in the Extended Boolean Retrieval Framework. In Robert Korfhage, Edie M. Rasmussen, and Peter Willett, edi- tors, ACM/SIGIR 1993, Proceedings of 16th Annual In- ternational Conference on Research and Development in Information Retrieval, Pittsburgh, USA, pages 291– 297, 1993. [Lux and Chatzichristofis, 2008] Mathias Lux and A. Sav- vas Chatzichristofis. Lire: Lucene Image Retrieval: An Extensible Java CBIR Library. In MM ’08: Proceeding of the 16th ACM International Conference on Multime- dia, pages 1085–1088. ACM, 2008. [Nelder and Mead, 1965] A. J. Nelder and R. Mead. A Simplex Method for Function Minimization. Computer Journal, 7:308–313, 1965. [Rocchio, 1971] J. Rocchio. Relevance Feedback in In- formation Retrieval. In The SMART Retrieval System, pages 313–323. 1971. [Salton et al., 1983] Gerard Salton, A. Edward Fox, and Harry Wu. Extended Boolean Information Retrieval. Commun. ACM, 26(11):1022–1036, 1983. [Schmitt and Zellho¨fer, 2009] Ingo Schmitt and David Zellho¨fer. Lernen nutzerspezifischer Gewichte inner- halb einer logikbasierten Anfragesprache. In Chri- stoph Johann Freytag, Thomas Ruf, Wolfgang Leh- ner, and Gottfried Vossen, editors, Datenbanksysteme in Business, Technologie und Web (BTW 2009), 13. Fach- tagung des GI-Fachbereichs ”Datenbanken und Infor- mationssysteme (DBIS), Proceedings, 2.-6. Ma¨rz 2009, Mu¨nster, Germany, volume 144 of lni, pages 137–156. GI, 2009. [Schmitt, 2008] Ingo Schmitt. QQL: A DB&IR Query Language. The VLDB Journal, 17(1):39–56, 2008. 235 [Skov et al., 2004] Metter Skov, Henriette Pedersen, Bir- ger Larsen, and Peter Ingwersen. Testing the Princi- ple of Polyrepresentation. In Peter Ingwersen, C.J. van Rijsbergen, and Nick Belkin, editors, Proceedings of ACM SIGIR 2004 Workshop on ”Information Retrieval in Context”, pages 47–49, 2004. [Turtle and Croft, 1991] Howard Turtle and Bruce W. Croft. Evaluation of an inference network-based retrie- val model. ACM Trans. Inf. Syst., 9(3):187–222, 1991. [van Rijsbergen, 2004] C.J. van Rijsbergen. The Geometry of Information Retrieval. Cambridge University Press, Cambridge, England, 2004. [White, 2006] W. Ryen White. Using searcher simulations to redesign a polyrepresentative implicit feedback inter- face. Inf. Process. Manage., 42(5):1185–1202, 2006. [Zellho¨fer and Schmitt, 2008] David Zellho¨fer and Ingo Schmitt. A Poset Based Approach for Condition Weigh- ting. In 6th International Workshop on Adaptive Multi- media Retrieval. 2008. [Zellho¨fer and Schmitt, 2009] David Zellho¨fer and Ingo Schmitt. A Preference-based Approach for Interacti- ve Weight Learning: Learning Weights within a Logic- Based Query Language. Distributed and Parallel Data- bases, 2009. [Zellho¨fer, 2010a] David Zellho¨fer. Eliciting Inductive User Preferences for Multimedia Information Retrieval. In Wolf-Tilo Balke and Christoph Lofi, editors, Procee- dings of the 22nd Workshop ”Grundlagen von Daten- banken 2010”, volume 581, 2010. [Zellho¨fer, 2010b] David Zellho¨fer. Inductive User Prefe- rence Manipulation for Multimedia Retrieval. In Lasz- lo Bo¨szo¨rmenyi, Dumitru Burdescu, Philip Davies, and David Newell, editors, Proc. of the Second International Conference on Advances in Multimedia, 2010. Abbildung 4: Rankfehler fu¨r Verfahren 1 (oben) und Verfahren 2 (unten) basierend auf der Pra¨ferenz aus der ersten Iteration aus Abbildung 5. Docs by Conjunction Docs after 1st iteration Docs after 2nd iteration A B C Rank #0 Abbildung 5: Ausschnitt aus der Ergebnisliste nach zwei Iterationen. (Die Pfeile geben umgedrehte Pra¨ferenzen zwischen zwei Dokumenten an.) 236 Workshop on Knowledge and Experience Management FGWM 2010 Joachim Baumeister Intelligent Systems Computer Science VI University of Wu¨rzburg joba@uni-wuerzburg.de Thomas Roth-Berghofer University of Hildesheim & German Research Center for Artificial Intelligence DFKI GmbH thomas.roth-berghofer@dfki.de The FGWMWorkshop Series The workshop “Knowledge and Experience Management” is organised annually by the Special Interest Group Knowl- edge Management (Fachgruppe Wissensmanagement, in short FGWM) of the German Informatics society (GI). The main goal is to enable and further the exchange of innova- tive ideas and practical applications in the field of knowl- edge and experience management. Also, it aims to provide an interdisciplinary forum. Submissions from current research out of these and ad- jacent areas were welcome. Moreover, contributions that describe work in progress or approaches that have not yet been investigated comprehensively were of special inter- est. The workshop provided the opportunity for young researchers to present and discuss preliminary work, get feedback on their work from a larger audience with people having quite different viewpoints, get accustomed to pre- senting and defending their work at scientific events. As every year, the workshop was a forum for testing the via- bility of new ideas, by junior as well as senior researchers, before more effort and resources are put into them. FGWM 2010 The proceedings contain the papers presented at FGWM 2010 held on October 4–6, 2010 in Kassel, Germany. Six research papers and one resubmission of previously pub- lished work were accepted. For the latter only a one page abstract is included in the proceedings. The topics of inter- est of the FGWM workshop series are: • Experience/knowledge search and knowledge integra- tion approaches (case-based reasoning, logic-based approaches, text-based approaches, semantic por- tals/wikis/blogs, Web 2.0, etc.) • Applications of knowledge and experience man- agement (corporate memories, e-commerce, design, tutoring/e-learning, e-government, software engineer- ing, robotics, medicine, etc.) • (Semantic) Web Services for knowledge management • Agile approaches within the KM domain • Agent-based & Peer-to-Peer knowledge management • Just-in-time retrieval and knowledge capturing • Ways of knowledge representation (ontologies, simi- larity, retrieval, adaptive knowledge, etc.) • Support of authoring and maintenance processes (change management, requirements tracing, (dis- tributed) version control, etc.) • Evaluation of knowledge management systems • Practical experiences (“lessons learned”) with IT aided approaches • Integration of knowledge management and business processes • Introspection and explanation capabilities of knowl- edge management systems • Linked (Open) Data Programme Comittee • Andreas Abecker, FZI Karlsruhe • Klaus-Dieter Althoff, University of Hildesheim • Joachim Baumeister, University of Wu¨rzburg • Ralph Bergmann, University of Trier • Andrea Kohlhase, Jacobs University Bremen • Ronald Maier, University of Innsbruck, Austria • Heiko Maus, DFKI GmbH • Mirjam Minor, University of Trier • Markus Nick, empolis GmH • Ulrich Reimer, University of Applied Sciences St. Gallen, Switzerland • Bodo Rieger, University of Osnabru¨ck • Thomas Roth-Berghofer, University of Hildesheim & DFKI GmbH • Rainer Schmidt, University of Rostock • Steffen Staab, University of Koblenz-Landau Additional Reviewers Daniel Bachlechner, Christopher Harb, and Leo Sauermann We thank the Program Committee and, particularly, the ad- ditional reviewers for their efforts and good work. If you would like to participate in further discussions or receive further information about future workshops you might consider joining the FGWM mailing list 1. August 2010 Joachim Baumeister Thomas Roth-Berghofer Acknowledgements This volume has been produced in part using the EasyChair system2. We would like to express our gratitude to its author Andrei Voronkov. 1http://www.fgwm.de/newsletter.html 2http://www.easychair.org 237 Knowledge System Prototyping for Usability Engineering Martina Freiberg, Johannes Mitlmeier, Joachim Baumeister, Frank Puppe University of Wu¨rzburg D-97074, Wu¨rzburg, Germany freiberg/joba/puppe@informatik.uni-wuerzburg.de Abstract Knowledge-based consultation and documenta- tion systems are widely distributed in industrial and medical environments today. Yet, their im- plementation still is a tedious and costly task. Furthermore, the aspect of usability—which is in principle of critical importance for those systems—is often nearly unconsidered. We ar- gue, that tailored UI prototyping can help to tackle both issues. Therefore, we propose a UI prototyping tool for knowledge-based systems, intended to enhance knowledge systems engi- neering in itself, and to foster usability engineer- ing in that context. Keywords: Knowledge-based System, User Interface Prototyping, Usability, Human Computer Interaction 1 Introduction Knowledge systems (KS)—in the context of our pa- per knowledge-based consultation and documentation systems—are applied in various industrial and medical en- vironments today. Regarding their development, it has to be differentiated between knowledge system engineer- ing (KS Engineering) and knowledge engineering (KE). The former comprises the entire development process of a knowledge-based system, including especially its UI and interaction design; the latter specifically addresses the def- inition and formalization of the required knowledge, e.g., the terminology, or explicit problem-solving knowledge. KS Engineering Pitfalls and SE Solutions Despite the widespread use of knowledge systems, and consequently increasing research efforts regarding their de- velopment in the last decades, KS Engineering still remains a tedious and complex task. Among the main pitfalls are high development costs, both in terms of money and time. Thereby, the sub-task KE alone often causes a major part of the expenses, hence influencing other KS Engineering activities: In many cases, UI- and interaction design in general—or a more targeted comparison of several equal design options—would require more attention. For knowledge based systems, it is further of critical importance, that they are intuitive and easy to use as to not distract the users from the often difficult, domain-specific jobs, they are intended to support (e.g., decision-support in medical contexts). Yet, despite various recognized general usability engineering approaches, usability issues often also remain almost unconsidered. Finally, the often still missing true understanding of such systems and their benefits on the side of potential customers, can make it difficult to promote respective projects in the first place due to the overall complexity and costs. In general software engineering (SE) and in human- computer interaction (HCI), user interface (UI) prototyping is an established method for iterative specification and refinement before implementing the productive system [3; 4]. The increased flexibility arising from prototyping- based specification and design offers the chance to adapt system (and especially interface) requirements to changing base requirements or customer wishes in a more efficient, inexpensive manner. The affordabil- ity of the approach also permits the early evaluation and comparison of design alternatives. In providing a (visual) basis for communication, UI prototyping can help to specify requirements more precisely. Thus, the potential risks of fundamental misunderstandings, and a resulting, more expensive redesign of central conceptions at a later stage of the project, can be reduced. Also, the overall system vision can be communicated and re- fined more easily with the help of an appropriate prototype. In tailoring UI prototyping for knowledge systems, we aim at exploiting those advantages. Particularly, we intend to foster affordable, pragmatic KS Engineering, that both helps to promote respective projects in the first place, and alleviates the overall task. Our approach intentionally fo- cusses on interface and interaction design of KS, and on an increased integration of usability considerations in the process. Regarding the specification and formalization of the required terminology and explicit knowledge of a KS, there exist various established KE methods today, each of which can be equally well applied—thus, we do not further discuss or value their suitability here. Related Work In general SE and in HCI there exist numerous classify- ing approaches regarding prototyping and the prototyping process, e.g., see [3; 4; 5; 8; 11]. Apart from manifold general prototyping tools and methodologies available— see Beaudouin-Lafon and Mackay [4] for an overview— also tailored tools have been developed in various specific domains; examples are the field of multimodal interaction research [12], or the field of cross device interface design [10]. To the best of our knowledge, no prototyping ap- proaches or tools exist, that specifically address knowledge systems according to our definition (see Section 3.1). Lim et al. [9] note that in HCI, prototypes to date 239 mainly are used for evaluation purposes—such as usabil- ity testing—and that in SE, prototyping mostly constitutes a means for supporting requirements engineering; in ex- tension to that, they suggest prototypes as tools for de- signers to frame, refine, and discover options in a design space. Regarding an integration of software- and usability engineering, Memmel et al. [13] similarly claim an incor- poration of visual requirements engineering—based on ap- propriate prototypes—much earlier in the overall process. Merging those insights, we propose UI prototyping as a rather pragmatic means for KS requirements engineering. As opposed to the above approaches, we explicitly address knowledge systems, that exhibit some specific characteris- tics. First, KS mostly consist of a rather fixed set of UI elements and user-system interactions; most often, for ex- ample, questions are presented, answered by the user, and cause a certain follow-up system (re)action. An adequate prototyping tool thus firstly should support the design of such elementary elements as flexibly as possible. For real- istically emulating actual knowledge systems, additionally the imitation—or actual integration—of the underlying ex- plicit knowledge needs to be supported, as to enable a rea- sonable judgement regarding the applicability and usability of the overall future system. When designing with usability in mind, some kind of iterative process is highly advisable [14]. Angele et al. [1] introduce a cyclic process model for developing knowledge-based systems; it incorporates prototyping techniques, but furthermore also formal specification and KE activities, thus constituting an entire, rather heavy- weight engineering approach. Contrastingly, we suggest an extension of the rather lightweight Agile Process Model [2]; thereby, the focus is on providing an overall pragmatic method of KS Engineering and on enabling a rather inexpensive integration of usability activities. In summary, we contribute to current research by • proposing an overall approach that integrates efficient, affordable KS Engineering and usability engineering. • introducing the UI prototyping and engineering tool ProET, specifically tailored for the design of knowledge-based systems The remainder of the paper is organized as follows: In Sec- tion 2, we discuss a customized, prototyping-based KS En- gineering process, as well as its potential regarding an in- tegration of usability activities. We present ProET, an UI prototyping and engineering tool for knowledge-based sys- tems, in Section 3. In Section 4, we report on experiences of exemplarily recreating existing knowledge systems with the tool, and on its consequential current scope and limi- tations. We conclude with a summary and an outlook to further research directions in Section 5. 2 Pragmatic KS Engineering for Usability Regarding knowledge system development and knowledge engineering, there exist diverse approaches today, such as CommonKADS, MIKE, or adaptions of the classical stage- based and incremental software development models. Yet, for the success of knowledge system projects also and es- pecially regarding small to mid-sized companies, a prag- matic approach—affordable and efficient regarding time and effort—is essential, c.f. [2]. Especially for promoting such projects in the first place, it is important to quickly UA1: Prototype Expert / Hybrid UA2: Prod. System, User-based / Hybrid Integration System Metaphor Planning Game Implementation Tailored Patterns & Prototypes Figure 1: Extended Agile Process Model. come up with first solutions, e.g., in the form of proto- types or example implementations. In this respect, we made positive experiences with applying the Agile Process Model, described in [2]. However, that model emphasizes knowledge base development, not yet taking much into ac- count the design of the target system’s interface, or usabil- ity traits. Targeting an overall approach that supports pragmatic, affordable, and usability-involving KS Engineering, we propose the extension of the Agile Process Model by integrating pattern-based design, prototyping, and usability techniques into the original model. Figure 1 introduces the entire resulting Extended Agile Process Model. Although pattern integration and respective activities are included in the following for reasons of completeness, their more detailed discussion is part of further work, see [7]. The gray parts of Figure 1 represent the original model, consisting of the four phases System Metaphor, Planning Game, Implementation, and Integration. For a detailed dis- cussion, see [2]. Basically, tailored patterns and prototyping can support both System Metaphor and Planning Game. In System Metaphor, the system objectives are defined by developers and customers. Based on appropriate patterns and corre- sponding implementation examples, a basic idea can be de- veloped more easily. Thereby, patterns can be assessed ei- ther manually, or by using a tailored recommender system, that suggests patterns matching the target context. Proto- types, that also provide the relevant user-system interac- tions, further support that process by presenting a realistic simulation of a potentially resulting system as opposed to the static, visual depiction of knowledge system examples provided by the patterns. The Planning Game defines the scope and prioritization of development tasks. Here, patterns ease the analysis and valuation of system requirements—taking place during the Exploration sub-phase of the planning game—by providing clear specifications of required features and interactions. Additionally, prototyping supports that task by allowing for actually trying out (and thus better evaluating) relevant functionalities. With regards to Usability Activities, the original model can be extended both regarding Implementation and In- tegration (Figure 1, UA1, UA2). The basic model de- fines Implementation as a test-first activity—i.e., before ac- tually implementing new or additional features, the cor- responding tests for assuring their correctness are devel- oped. This can be expanded by an evaluation-first activ- ity, in the sense that based on the formerly created pro- 240 totypes, usability issues are assessed and valued first, be- fore continuing with test-first implementation as defined by the model. Performing prototype-based usability eval- uation offers the chance to reveal defects of the design at early stages. This can considerably lower development costs, as the adaption/revision of a prototype is rather in- expensive, in contrast to adapting an preliminarily imple- mented, or even already productive system. Without going into detail here, at that stage, expert- or hybrid approaches (according to a categorization suggested in [6]) seem to be most appropriate. For example, rather light-weight techniques—as feature-/consistency inspection—but also more comprehensive methods—as heuristic evaluation or expert walkthrough—are performed by the developer (”ex- pert”). In case even future users—e.g., project partners or their employees—are available, hybrid methods such as pluralistic walkthrough or participatory/cooperative heuris- tic evaluation potentially can provide the most benefits. However, some of those techniques require at least a partly functional system—as explained in Section 3, also fundamental interactions of knowledge systems can be designed/simulated with the suggested prototyping tool ProET. Thus, those techniques are (at least partly) appli- cable to the developed UI prototypes. During Integration, the implemented functionality is added to the productive system, using integration tests for assuring its overall correctness and integrity. Such testing can be extended by usability evaluation activities that check, whether the system still meets the specified usability goals. As Integration results in a running version of the productive system, it is not only possible, but rather highly advisable, to evaluate the applicability of the system in the target context with real users. Thus, not only hybrid, but also purely user-based usability evaluation techniques are beneficial—example techniques are querying, user studies, or controlled experiments. Additionally, again also Hybrid Approaches may also provide valuable insights regarding the actual use of the knowledge system and potential, remaining defects. The suggested approach aims at turning overall KS Engineering into a more pragmatic process, equally suitable for promoting KS projects—by quickly setting up and presenting actual KS examples (prototypes) to customers—and for specifying requirements as well as the system design in a more agile manner. Due to the highly iterative process, that also involves usability evaluation activities at specified stages, potential system flaws may be detected, or even prevented more effectively. 3 The Prototyping Tool ProET In this section, we introduce the prototyping and engineer- ing tool ProET, that we are developing to support afford- able and efficient KS Engineering, as well as to foster an eased integration of usability-related activities in the over- all process. Therefore, we first define the specific type of target systems, as well as typical components those sys- tems are built of. Afterwards, we introduce ProET and its workflow of creating prototypes in more detail. 3.1 Target Knowledge Systems and Components By knowledge system, we understand systems that may implement various forms of knowledge—such as rules, or covering models—to support the user as efficiently as possible in performing the task at hand. Thereby, we specifically think of consultation and documentation tasks—in the first case, the system provides decision- support or recommendations regarding a specified problem area (e.g., in medical or fault detection contexts); in the second case, users are assisted in entering a certain set of data and the system ensures its quality (e.g., measured by completeness and correctness). The initial capabilities of the tool are based on our past experiences with developing knowledge-based systems. For the greater part, those were implemented as web-based systems—not necessarily meaning they are made available to large masses of users via internet, but in the general sense of ”running in a browser”. Thus, the tool specifically supports web-based consultation and documentation systems engineering. For those target systems, typically a certain set of visible knowledge components can be identified: - Questions: Requesting required input data from the user - Questionnaires: May be used to group the (potentially large set of) questions - Answers: A fixed set of reasonable input data (answer alternatives) to choose from, or free text input facilities - Solutions: Fault or medical diagnoses, or action recom- mendations, that are derived by the included diagnosis knowledge (invisible knowledge components, e.g., rules) - Ancillary Information: Informal knowledge repre- sentations, detailed elaborations of questions/solutions, or add-on information regarding the overall consulta- tion/documentation progress. Those components form the conceptual basis of the wid- gets currently supported by ProET. Thus, they constitute the elementary items that are to be specified in the declara- tive prototype specification file (as described in the follow- ing section in more detail). 3.2 Introducing ProET The prototyping and engineering tool ProET is an UI prototyping tool specifically tailored for web-based consul- tation and documentation systems. Thereby, prototyping is supported gradually: First, exemplary system definitions (and corresponding templates/styles) allow for quickly and easily creating and exploring the basic collection of knowledge systems supported so far. Based on those available specifications, adapted KS prototypes can be created in a copy & modify manner, where the degree of modification can vary arbitrarily. With the extensibility of the tool, finally also entirely different interfaces/UI components can be developed and integrated, if required (see Section ”Extending ProET”). The tool supports two basic modes of prototyping: Com- plete specification of all textual elements, as well as their (partly or entire) auto-generation. The first variant is useful for prototyping and evaluating concrete KS ideas. The required visible knowledge com- ponents (such as questions, answers...) are defined in the declarative prototype specification file. As also elementary interactions—e.g. coloring the next suggested question, or hiding/unfolding parts of the dialog—are available, future knowledge systems can be simulated rather realistically. Thus, it can be examined whether a chosen UI/interaction design is suitable in a given, domain-specific context. The option of auto-generating textual elements, further- 241 more allows for a more design-oriented prototyping: Not having to consider the specification of actually reasonable knowledge base elements simplifies the concentration on UI/interaction design questions. This can be helpful in case several designs are to be compared against each other, or provided that general design issues need to be evaluated, independent from any future system. Technical Basis ProET is a JAVA application that integrates several web-based technologies for engineering UI prototypes of web-based knowledge systems. The created prototype is HTML-based, enriched by JavaScript/AJAX for inter- activity and styled by CSS. The tool is probably most comfortably used from within some kind of IDE—such as Eclipse1—that supports editing of the required file formats, as well as an easy management of the project itself. Prototyping Workflow To provide a first impression, Figure 2 presents a questionary-style, partly auto-generated consultation sys- tem prototype. Figure 2 (A) displays a page containing concretely specified questions and answers; another page of the same prototype, with auto-generated textual ele- ments, is shown in Figure 2 (B). Prototyping with ProET currently is purely text-based. We use the above prototype as a running example when introducing the three types of specification files required for prototyping with ProET: • An XML-based specification file (central prototype- and textual content specification) • String Template files (creating HTML-/JavaScript- based fragments for each defined component) • CSS files (design definition of specified components) Once the specification via these files is finished, the proto- type is assembled: Filling in the textual contents from the XML specification, HTML-/JavaScript-based component representations are created using the String Templates for the framework and CSS for the concrete styling. XML-based, Elementary Prototype Specification The elementary prototype specification—i.e., its skeletal struc- ture, consisting of the textual elements as well as of their basic UI properties (e.g., whether the question-style is one- choice or multiple-choice)—is provided in an XML-based format. For each of the visible (textual) knowledge sys- tem components, matching tags are provided—e.g., an tag is used for defining answers. Within those tags, the fundamental UI properties are specified as attributes—a multiple choice question, for example, is de- fined using . Fur- thermore, the XML specification references the respective template- and CSS files, that are additionally required for creating the prototype. Figure 2 (C) presents an excerpt of the file used for specifying the prototype shown in (A). Excluding the standard XML-header, the framing tag is the topmost element; there, the knowledge system type—here: type=’gen’, referring to a predefined, (partly) auto-generated prototype style—is defined, as well as references to the respective template namespace, 1http://www.eclipse.org/ and style files. Figure 2 (D) exemplifies the specifi- cation of a questionnaire (here still named page) that consists of several questions; due to space reasons, only the detailed definition of the first question is printed completely. The example illustrates the specification of the question Do you like surveys?, the set- ting of its basic UI style by answer-type=’oc’, as well as of its three answer alternatives Yes, No, Neither...Nor. Furthermore, the definition of gen- erated textual elements is exemplified in Figure 2 (E). This is achieved by using tags, that can be attributed by the desired number of questions of the generated page (num-questions=’3’) and respective answer alternatives (num-answers=’3’), by the text lengths (question-/answer-length=’...) or by the basic answer-style (as above). Regarding the auto-generation, it is possible to either define the number of questions (answers, text/answer length) strictly—e.g. num-questions=’3’—or to specify a more flexible range—e.g. num-questions=’2-5’—resulting in randomly calculated minimum of two and maximum of five questions. String Templates String Template files provide the HTML-counterparts for each of the defined knowledge system components. When creating a prototype, one fram- ing template is defined for the dialog as a whole—Figure 2 (F)—which contains the skeletal HTML-framework. From there, sub-templates are referenced, that define the HTML fragment of the respective components separately—in the figure, $children$ means that the also depicted template for the children of the dialog-content element— which are pages—is inserted at this point. Splitting the UI templates into framing- and sub-templates, according to the corresponding component definitions, provides the advantage that the separate component templates, e.g. for a page, can be reused in several different prototypes. CSS for UI styling The actual styling/design of the components is finally specified using standard CSS. Most basically, each knowledge system component can be globally styled by a CSS class with a matching name— e.g., common properties of all questions can be set by a .question class. Yet, a more fine-granular styling is also possible—additional, element-specific CSS classes or IDs can be set within the template files, and then can also be specified separately in the CSS file. Interactivity With ProET not only static UI designs can be created, but also elementary interactivity that would be expected of an actually implemented knowledge sys- tem. Examples are the interactive coloring of questions (or solutions) according to their current status—e.g., an- swered/suggested next (or established/excluded)—or hid- ing parts of the questions until another defined question was answered by the user. Such interactive behavior is achieved by the usage of JavaScript in ProET. The required func- tionality thereby is defined in separate JS files, whereas the corresponding function calls are inserted in the String Tem- plate file where needed—e.g., if a given element should provide some interactivity on mouse-click, in the template file like onClick="javascript:doSomething() is inserted within the respective tag defining that element. 242 CD E B A F Figure 2: Partly auto-generated prototype of a questionary-like consultation system—manually entered questions (A), auto-generated questions (B), the corresponding prototype specification (C), and an exemplary template file (F). 243 Extending ProET Apart from just adapting existing system tem- plates/designs, it is also possible to extend the tool by defining entirely new elements. Thereby it has to be differentiated between rather simple extensions— such as new, XML-based prototype specifications—and more sophisticated extensions regarding entirely novel components—for example, progress indicators or the like as separate elements. Regarding the definition of a new prototype, simply a new XML specification is created, using the available set of tags and corresponding attributes. New/adapted String Templates for components can be introduced by providing a corresponding type attribute within the dialog tag and a corresponding String Template file; the type-value then is matched by a certain mechanism with potential templates until the most appropriate one is found; due to space rea- sons, the entire naming-and-matching mechanism is not ex- plained in detail here. Finally, to modify not the basic form of prototype but only its UI design, additional CSS files may be created and just referenced in the tag of the specification. This permits an easy exchange and comparison of several design options. The extension of the tool by entirely new components, on the other hand, also is possible, yet more complex. This ad- ditionally requires an adaption of the underlying Java code (e.g., create new container classes for the element, adapt the XML parsers to correctly parse that new elements, and so on). The details of this entire extension process, how- ever, would fall out of the scope of this paper. 4 Experiences with ProET For a preliminary assessment of the capabilities of ProET, we recreated knowledge systems that have been developed by our department in the past. Thereby, the first goal was to match those original systems as closely as possible. Figures 3 and 4 present the outcome of replicating two rather different systems; each figure shows both the origi- nal system in the background (A) and the prototype in the foreground (B). In the following, we first shortly introduce each system and summarize the insights regarding its replication with ProET. Based on that, we discuss the scope and limitations of the tool in the subsequent section. 4.1 Knowledge System Replication The Consultation on Rheumatic Disease, Figure 3, served as the first case study. Based on the entered symptoms, that system consults the user as to whether a rheumatic disease is probable. Basically, a questionary-style is implemented, meaning that the system presents more than one question at a time to the user. Thereby, a certain coloring metaphor is applied for supporting an answering of the questions in the most reasonable sequence: Questions, that are already answered are colored gray; not yet answered questions are colored yellow, and the suggested next question is colored green. This coloring-based interactivity is also mirrored in the prototype. There, JavaScript-based techniques are used to change the coloring according to the user’s actions. Thus, the basic system interaction and styling can be realized by the means of the prototyping tool quite well. In some mi- nor aspects, however, the prototype differs from the orig- inal system. First, a seemingly more loosely assembled interface appearance; whereas in the original system kind of a (visible) table-based layout was used, the prototype builds on a CSS-based layout, defining questions as sepa- rate elements and rendering them without using tables at all. If desired, however, it is with some effort possible, to adapt the template files as to also use (visible) tables for element arrangement. Furthermore, the original system provides a progress bar at the top of the dialog, indicating feedback on the proportion of already processed questions. Such feedback components are currently not yet included in ProET—by using JavaScript techniques, and by defin- ing a corresponding component type and respective tem- plates/styles, such or similar interactive feedback elements could be added in the future. As a second case study, we chose the Labour Legislation Consultation, that in contrast to the rheumatic consultation implements a fundamentally different, hierarchy-based in- teraction style. There, the problem to solve—in this case the question, whether an employment contract was termi- nated legitimately—is displayed as the top element in the hierarchy (see Figure 4) and its current rating (e.g., estab- lished/suggested/excluded) is indicated by its coloring. On the next hierarchical sub-layer, questions are displayed that help to clarify that problem—in the example, the second item from the top ”Compliance with form...”, and all items on the same hierarchical level. If reasonable, those ques- tions are further subdivided—for example, question ”Dis- missal was not prohibited...” (third item from the top) is subdivided into eight further questions, that describe more detailed aspects of the parent question. Thus, the user can choose, whether to answer the more abstract questions, or rather the more refined ones. Based on the provided an- swers, the system calculates ratings regarding the problem statement; those ratings of the sub-questions are accumu- lated into one rating, that is then presented for the parent question—this propagation proceeds up through the com- plete hierarchy to the main problem statement/solution. Figure 4 shows, that this system type is matched well by the tool. Also the necessary interactivity— unfolding sub-hierarchies of questions by mouse-click, coloring the questions depending on the answer, and accumulating/propagating solution states throughout the hierarchy—is supported. One minor difference between the original system and the prototype again concerns the interface design: First, the ”+” and ”-” signs—indicating whether a question can be further subdivided—are not integrated in ProET. Also, the coloring of the questions does not end with the last character of the question as in the original, but spans a defined width. 4.2 ProET: Scope and Limitations First replication case studies so far revealed the need to extend ProET with further components and templates, as to be able to recreate the chosen initial set of knowl- edge systems completely. Examples are aforementioned feedback components (e.g., progress bars), or further op- tions/templates regarding the general layout (e.g., table- based, or multi-column layouts). Despite such minor short- comings, ProET currently comprises a reasonable set of widgets, as well as exemplary prototype specification files, that enable the near-complete recreation of many knowl- edge systems, developed by our department in the past. Their replication, as well as adapting the given system specifications and corresponding template/style files for ex- amining alternative designs, was rather unproblematic. 244 AB Figure 3: Consultation on rheumatic diseases—an example of a standard questionary-style consultation system. Original system (A, in german) and recreated prototype (B) A B Figure 4: Labour legislation consultation—an example of a hierarchical-style consultation system. Original system (A, in german) and recreated prototype (B). 245 Yet, the different KS types and by default available wid- gets that can be prototyped without (major) tool exten- sions is limited. This is due to the fact that we explic- itly chose web-based consultation and documentation sys- tems for defining the initial set of features supported by the tool. Apart from the static widgets, so far also only selected interaction forms are supported. This mainly en- ables the creation of two basic system styles at the moment: Questionary-based (Figure 3) and navigable hierarchy- based (Figure 4). Those elementary styles can be adapted with regards to various aspects, such as grouping questions into questionnaires, optionally showing side panels that for a direct navigation of the pages, the presentation of solu- tions and their derivation states, or different forms of de- signing header and footer elements. We are aware, that there surely exist other equally rele- vant knowledge system types and respective designs devel- oped outside our department; yet, a more comprehensive investigation of such external systems, the identification of additional, fundamental KS components, and the appropri- ate extension of ProET is subject of further research. 5 Conclusion In this paper, we introduced the tool ProET for develop- ing knowledge system prototypes. We introduced a tai- lored process model for pattern-based, prototyping- and usability-integrating KS Engineering, and we discussed po- tential benefits as well as how the approach can be ap- plied for pragmatically promoting and conducting respec- tive projects. First experiences with ProET revealed the need of its fur- ther extension. Apart from systems created by our depart- ment, also externally developed knowledge systems will be examined as to identify further relevant components and in- teractions; a more extensive classification of fundamental elements—in terms of a widget ”language”/library—will be defined, leading to an extension of ProET as to match that ”language”. Also, prototyping with ProET is currently purely text-based. Extending the tool to allow also for a more visual form of prototyping—e.g., assembling proto- type elements via drag & drop—is another interesting re- search issue, as this provides the chance to render the pro- totyping process per se much more intuitive. Another open question is, whether usability evaluation components can and should be directly integrated into ProET (e.g., integrating some tailored logging mechanism to track the ”usage” of the prototype). A further idea is in- corporating tailored usability guidelines/heuristics into the tool in the form of interactive questionnaires, that can be optionally rendered integrated with the prototype, enabling its rather straightforward evaluation. Also, the direct linking of the d3web toolkit2 to ProET is under way. d3web facilitates the development of deploy- able knowledge bases, thereby supporting various problem- solving methods (e.g., heuristic rules, or set-covering mod- els). This coupling first enables the integration of deploy- able knowledge bases with ProET, permitting an easy in- vestigation, which KS type and corresponding (interaction) design is suitable for a given knowledge base—e.g., devel- oped in the course of actual projects—or also more gen- erally, for a specific knowledge representation. The long- term objective is to extend UI prototypes into productive knowledge systems with no or minimum additional effort. 2http://d3web.sourceforge.net/ References [1] J. Angele, D. Fensel, D. Landes, R. Studer, Developing Knowledge-Based Systems with MIKE, Automated Soft- ware Engg. 5 (4) (1998) 389–418. [2] J. Baumeister, D. Seipel, F. Puppe, Agile development of rule systems, in: Giurca, Gasevic, Taveter (eds.), Handbook of Research on Emerging Rule-Based Languages and Tech- nologies: Open Solutions and Approaches, IGI Publishing, 2009. [3] Ba¨umer, Dirk and Bischofberger, Walter R. and Lichter, Horst and Zu¨llighoven, Heinz, User Interface Prototyping— Concepts, Tools, and Experience, in: ICSE ’96: Proceed- ings of the 18th international conference on Software engi- neering, 1996, pp. 532–541. [4] M. Beaudouin-Lafon, W. Mackay, Prototyping tools and techniques, in: The human-computer interaction handbook: fundamentals, evolving technologies and emerging applica- tions, L. Erlbaum Associates Inc., Hillsdale, NJ, USA, 2003, pp. 1006–1031. [5] C. Floyd, A Systematic Look at Prototyping, in: Approaches to Prototyping, Springer-Verlag New York, Inc., 1984. [6] M. Freiberg, J. Baumeister, A survey on usability evalua- tion techniques and an analysis of their actual application, Tech. Rep. 450, Institute of Computer Science, University of Wu¨rzburg, Germany (2008). [7] M. Freiberg, J. Baumeister, F. Puppe, Interaction pat- tern categories—pragmatic engineering of knowledge- based systems, in: Proceedings of the 6th Workshop on Knowledge Engineering and Software Engineering (KESE- 2010) at the 33rd German Conference on Artificial Intelli- gence, 2010. [8] H. Lichter, M. Schneider-Hufschmidt, H. Zu¨llighoven, Pro- totyping in industrial software projects—bridging the gap between theory and practice, in: ICSE ’93: Proceedings of the 15th international conference on Software Engineering, 1993, pp. 221–229. [9] Y.-K. Lim, E. Stolterman, J. Tenenberg, The anatomy of prototypes: Prototypes as filters, prototypes as manifesta- tions of design ideas, ACM Trans. Comput.-Hum. Interact. 15 (2) (2008) 1–27. [10] J. Lin, J. A. Landay, Employing patterns and layers for early-stage design and prototyping of cross-device user in- terfaces, in: CHI ’08: Proceeding of the twenty-sixth an- nual SIGCHI conference on Human factors in computing systems, 2008, pp. 1313–1322. [11] M. McCurdy, C. Connors, G. Pyrzak, B. Kanefsky, A. Vera, Breaking the fidelity barrier: an examination of our current characterization of prototypes and an example of a mixed- fidelity success, in: CHI ’06: Proceedings of the SIGCHI conference on Human Factors in computing systems, 2006, pp. 1233–1242. [12] M. R. McGee-Lennon, A. Ramsay, D. McGookin, P. Gray, User evaluation of OIDE: a rapid prototyping platform for multimodal interaction, in: EICS ’09: Proceedings of the 1st ACM SIGCHI symposium on Engineering interactive com- puting systems, ACM, New York, NY, USA, 2009, pp. 237– 242. [13] T. Memmel, H. Reiterer, A. Holzinger, Agile methods and visual specification in software development: a chance to ensure universal access, in: UAHCI’07: Proceedings of the 4th international conference on Universal access in human computer interaction, Springer-Verlag, Berlin, Heidelberg, 2007, pp. 453–462. [14] J. Nielsen, Iterative User Interface Design, IEEE Computer 26 (11) (1993) 32–41. 246 Modeling of Diagnostic Guideline Knowledge in Semantic Wikis Reinhard Hatko, Jochen Reutelshoefer, Joachim Baumeister, and Frank Puppe Institute of Computer Science, University of Wu¨rzburg, Germany {lastname}@informatik.uni-wuerzburg.de Abstract Knowledge acquisition for diagnostic knowledge systems is a complex and tedious task. In par- ticular, the formalization of diagnostic guideline knowledge is challenging for the contributing do- main specialists. In this paper, we introduce the formal representation language DiaFlux, that is simple and easy to use on the one hand. On the other hand it allows for the definition of ex- ecutable clinical protocols, that can solve valu- able tasks being executed in the clinical context. Further, we describe a wiki-driven development process using the stepwise formalization and al- lowing for almost self-acquisition by the domain specialists. The applicability of the approach is demonstrated by a project developing a protocol for sepsis diagnosis and treatment by a collabo- ration of clinicians. This paper is currently under submission at the Open Knowledge Models Workshop at EKAW 2010. 1 Introduction In recent years, knowledge engineering research has been heavily influenced by the emergence of Web 2.0 applica- tions, such as wikis, blogs, and tagging systems. They pro- vide a simplified access and a light-weight approach for knowledge acquisition. Furthermore, those systems usu- ally allow for a distributed and (often) collaborative de- velopment process. One of the most popular examples is the wide-spread use of wikis as flexible knowledge man- agement tools, both in personal life and business environ- ments. In contrast to standard web applications the content of a wiki page can be created and modified by clicking an (often mandatory) edit button located on the web page. Due to the simple markup (less verbose than HTML), users are capable to author the content easily. Wikipedia is certainly the most popular example, where informal world knowl- edge is created and updated by a wiki. Introducing the semantic interpretation of wikis, the development of Se- mantic Wikis [1] allows for a more formalized definition of the knowledge. Today, Semantic Wikis are mainly used for collaborative ontology development, by providing a flexi- ble, web-based interface to build semantic applications. The main benefit of Semantic Wikis is their possibility to interweave different formalization types of knowledge in the same context. That way, ontological concept def- initions are mixed with free text and images within the wiki articles. Such tacit knowledge often serves as docu- mentation of the development process or as pursuing ad- ditional information not representable in a more formal manner. In detail, we call the interweaving of implicit and formal elements of knowledge and their interaction in the knowledge engineering process the knowledge for- malization continuum [2]. The knowledge formalization continuum emphasizes that usable knowledge ranges from very informal representations—such as text and images— to very explicit representations—such as logic formulas or consistency-based models. The metaphor frees the domain specialists and knowledge engineers to commit to a partic- ular knowledge formalization at an early stage of the de- velopment project, but offers a versatile understanding of the formalization process. Present Semantic Wiki imple- mentations serve as ontology engineering tools with some support of rules, thus covering a wide range of the knowl- edge formalization continuum. In this paper, we introduce the Semantic Wiki KnowWE, that was designed to build decision-support systems, and we propose the graphical language DiaFlux, for modeling of clinical protocols: The contributions of this language are its simple application for developing decision-support sys- tems, since it only provides a limited number of intuitive language elements. Due to its simplicity it is possible to be used by domain specialists and thus ease the application in the knowledge engineering process. Albeit its simplic- ity, a rich set of diagnostic elements can be integrated into the language, that are required to build sophisticated (med- ical) knowledge bases. Furthermore, the language allows for the incorporation of less explicit knowledge elements when needed, and thus follows the ideas of the knowledge formalization continuum. To allow for comfortable devel- opment of DiaFlux models, we introduce a visual editor integrated into the Semantic Wiki KnowWE. The rest of the paper is organized as follows: Section 2 briefly introduces the Semantic Wiki KnowWE and dis- cusses the integration of strong problem-solving knowl- edge into the context of a Semantic Wiki. Also, some knowledge engineering aspects, such as the organization of a distributed knowledge base over a wiki, are dis- cussed. The procedural knowledge representation language DiaFlux is introduced in Section 3. Also the integration of the language into the Semantic Wiki and development pro- cess including stepwise formalization is described. Cur- rently, the approach is evaluated by the development of a medical decision-support system. We describe the essen- tials of this case study in Section 4. The paper is summa- rized and concluded in Section 5, also giving an outlook for future work. 247 2 KnowWE in a Nutshell Wikis became famous as a successful means for knowl- edge aggregation in an evolutionary ’self-organizing’ way by possibly large and open communities without detailed project management about contributions. Semantic Wikis extend the wiki approach by adding formal representations of the wiki content. This extension allows for two different perspectives of the use of Semantic Wikis [3]: • Knowledge formalization for wiki: Here, the infor- mal content of the wiki is in the foreground. The for- malization used to help organization, navigation, and presentation of the content. Formal allowing for se- mantic navigation and search or the generation of new aggregated views on content, e.g., by inline queries. • Wiki for knowledge formalization: In this case, the formalized knowledge base is the goal of the applica- tion. The knowledge base is created by the use of the well-known wiki authoring metaphor. One example is the use of a Semantic Wiki as a collaborative ontology engineering tool. The informal content here is used as description and documentation of the formal concepts and relations. The wiki-based knowledge engineering approach de- scribed as well as the system introduced in this paper clearly focus on the latter perspective: The wiki is used as a tool for creating and maintaining formal concepts, for- mal relations, and informal knowledge containing descrip- tion as documentation for these. While many Semantic Wikis provide means to create and populate light-weight ontologies, the approach can be generalized to create any kind of formal knowledge bases, e.g., for decision-support systems. The Semantic Wiki KnowWE provides meth- ods to capture and execute problem-solving knowledge. Therefore, a problem-solving layer has been included on top of the ontological layer, defining findings that describe the currently running problem-solving session. KnowWE is designed showing only minimal modifications with re- spect to look and feel allowing for “backward compati- ble” use like a normal wiki. Figure 1 shows an article about Bad Ignition Timing as part of a car-diagnosis ex- ample wiki. It contains textual descriptions of the concept forming a solution of the diagnostic wiki knowledge base. Embedded in this informal content, it contains rules (1) de- fined by specific rule markup forming the formal knowl- edge deriving the concept Clogged Air Filter as solution of the problem-solving session. Further, KnowWE provides means for testing the knowledge, e.g., by answering pop- ups (2) that are defined in the page content. The status of the derived solutions of the current problem-solving ses- sion is shown in the left panel (3). At the top of the formal knowledge block different icons provide additional features like starting a problem-solving session as a guided inter- view in an external dialog frame or download of the gen- erated knowledge base (4). The problem-solving knowl- edge (rules in this example) is entered in textual format by specified markup using the standard wiki edit view, which is processed by the wiki engine and translated to an exe- cutable knowledge representation. Each time the content is edited, with the page save action the formal knowledge sections are processed by the wiki engine, updating the ex- ecutable knowledge base accordingly. Different knowledge formalization patterns (e.g., heuristic rules, decision-trees, set-covering knowledge) are supported by the system and can be captured by the use of various markups [4]. Following the freedom of structuring principle proposed by wikis, the system does not put up any constraints where formal knowledge should be defined. Objects and for- mal relations can be used on any wiki page at any loca- tion using the markup defined by the current system set- tings. KnowWE also provides components to test the cur- rent version of a knowledge base. For automated testing of the knowledge base behavior, KnowWE also allows for the definition for test-cases, which can be executed after knowledge base modifications. For the creation of vari- ants of a knowledge base the system provides a flexible in- clude mechanism allowing to include arbitrary knowledge elements into a new page forming a new compilation of the overall knowledge corpus (e.g., for fault diagnosis in devi- ating production series in technical domains). Wiki-based Knowledge Formalization The major strength of wikis is the general low barrier for contri- bution due to its simple editing mechanism. However, the definition of a formal knowledge base using textual markups is a complicated task. Autonomous contribution by domain specialists still can be achieved using a step- wise formalization process. Employing the metaphor of the knowledge formalization continuum at first informal knowledge describing the domain knowledge is inserted into the wiki. This can be done by domain specialists not demanding special knowledge engineering experiences. For this purpose also already existing documents can be imported into the wiki as startup knowledge that can be refined manually. After short training sessions discussing and modifying example knowledge bases the knowledge formalization task is started as an incremental process: The domain specialists, not completely familiar with the provided markups and the formalisms, at first formulate the knowledge that is right now only given as informal description, in pseudo-code style inspired by the markup. In cooperative sessions with the knowledge engineers this knowledge is discussed and transformed into correct syntactical shape. This proceeding allows for autonomous contributions, although if not yet completely formalized, by the domain specialists. The web-based collaborative access provided by wikis supports this evolutionary process. 3 DiaFlux - Modeling Clinical Care Processes in Semantic Wikis This section first describes our application scenario, then a short insight about guideline models in the diagnostics do- main is given. Following, we introduce our representation language for clinical protocols, called DiaFlux . 3.1 Application Scenario Clinical guidelines have shown their benefits by providing standardized treatment based on evidence-based medicine. Many textual guidelines are readily available and also shared through the internet, but rely on the proper applica- tion by the clinician during the actual care process. While clinical guidelines are mostly textual documents, clinical protocols are an implementation of them, offering a more specific procedure for diagnosis and treatment in a given clinical context [5]. Much effort has been put into the development of formal models for computer-interpretable guidelines (CIGs). Clinical decision-support systems that execute CIGs support the clinician in his decision-making 248 Figure 1: A wiki page of the car diagnosis example wiki containing formal and informal knowledge about the concept Bad Ignition Timing. at the point of care. In the variety of CIG models, each has its own focus, e.g. GLIF [6] focuses on the share- ability of guidelines between various institutions, while PROforma [7] focuses on assisting patient care through ac- tive decision support [8]. The work presented in this paper is conducted within the project “CliWE - Clinical Wiki Environments”1. We in- vestigate languages, tools and methodologies to collabora- tively build CIGs by domain specialists themselves. The requirement concerning the language is the development of an explicit and executable representation of diagnostic knowledge for active decision-support systems. Further- more, we create a development process for simple and ef- fective knowledge acquisition by domain specialists. Fi- nally, the completed knowledge bases will be exported into mixed-initiative systems, that cooperate with the clinical user during the care process. 3.2 Modeling Clinical Processes For the representation of the tasks, that have to be carried out during the process of care, guideline languages employ different kinds of Task Network Models [9]. They describe decisions, actions and constraints about their ordering in a guideline plan. Often, flowcharts are the underlying for- malism to explicitly express control flow, at least at an abstract level. GLIF for example, takes a multi-level ap- proach, on three levels of abstraction, a conceptual, a com- putable and an implementable level. The conceptual level is modeled using flowcharts for interpretation by humans, but can not be executed by decision-support systems. The formal specification is provided at the next lower - the com- putable - level. On the implementable level a guideline can 1Funded by Dra¨gerwerk AG & Co. KGaA, Lu¨beck, Germany, 2009-2011. be tailored to a specific institution, e.g. by the definition of mappings to patient information systems. For the specification of a clinical protocol, two kinds of knowledge have to be effectively combined, namely declar- ative and procedural knowledge. While the declarative part encompasses the facts and their relationships, the proce- dural one reflects the knowledge about how to perform a task, i.e. deciding which action to take next. In diagnos- tics the declarative knowledge particularly consists of the terminology, i.e., findings, solutions, and sometimes also treatments and their interrelation. The procedural knowl- edge for diagnostics in a given domain is responsible for the decision which action to perform next (e.g. in patient care: asking a question or carrying out a test). Each of these actions has a cost (e.g. monetary or associated risk) and a benefit (for establishing or excluding currently considered solutions) associated with it. Therefore, the choice of an appropriate sequence of actions is mandatory for efficient diagnosis and treatment. We therefore propose a knowledge representation for clinical protocols called DiaFlux . It is designed to be a sufficiently expressive knowledge representation for exe- cutable clinical protocols, yet intuitive enough for the self- acquisition by domain specialists. 3.3 Modeling with DiaFlux This paper introduces the formalization of clinical proto- cols with DiaFlux . By combining well-known elements from flowcharts (i.e. nodes and edges) with the tasks that are carried out during diagnostic problem-solving, it al- lows for a uniform and intuitive acquisition of declarative and procedural knowledge. As known from flowcharts, the nodes represent actions, that have to be executed, and the edges connecting those nodes the order of the execution. 249 An edge can be guarded by a condition, that controls the transition to the subsequent node. The nodes represent ac- tions like performing a test or evaluating a diagnosis. The guards are attached to edges and they control the process by checking the state of the declarative knowledge, e.g. the value of a finding or the evaluation of a solution. Clinical protocols, that are formalized with DiaFlux, are intuitively understandable and more easily maintainable, as the sequence of actions is made explicit. Furthermore, due to the semantics of an underlying application ontology it directly is executable. Design Goals When designing a formalism – especially when aimed at being used by non-computer-scientists – a trade-off has to be made between its expressiveness and its usability/under- standability. In our approach we favor usability over ex- pressiveness, offering a minimal set of primitives, though expressive enough for the targeted scenario. Besides, the following goals were pursued during the design of DiaFlux: 1. Modularity: To alleviate the reuse of (parts of) formal- ized knowledge, DiaFlux models are intended to be reused in different contexts. The modularization also helps to improve the maintainability of the knowledge base. 2. Repetitive execution of subtasks: Online monitoring involves the continuous observation of sensory data to detect fault states and initiate corrective action. There- fore particular actions need to be performed in an iter- ative manner. 3. Parallelism: Subtasks with no fixed order and de- pendency can be allocated to additionally spawned threads of control, and thus allow for their parallel ex- ecution. Expressing parallelism is especially neces- sary for mixed-initiative diagnosis, in which human and machine initiated examinations are carried out concurrently. 4. Testability: The evaluation of a knowledge base is an essential step prior to its productive use. We provide basic functionality for empirical testing and anomaly checks tailored to DiaFlux models. Language Description Wang et al. [10] studied several guideline representation models and identified common primitives for guideline cre- ation. These were categorised as actions, decisions and for the representation of patient state. Actions denote specific clinical tasks like collecting data or clinical intervention. Decisions represent clinical decision making and guide the course of the care process. Patient states represent a pa- tient’s clinical status in the context of guideline application. As DiaFlux models are based on flowcharts, we identi- fied a minimal subset of node types to incorporate the nec- essary primitives to express a clinical protocol. To express the procedural aspect of the protocol, nodes can be con- nected by edges. There are no different types of edges, but they can be labeled with different types of conditions. The actual type of the condition depends on the type of node the edge starts at. They are used to evaluate the declarative knowledge with respect to the observed findings, e.g. the outcome of a given test or the status of a diagnosis. To ob- tain the semantics necessary for executability, we rely on an application ontology as an extension to the task ontol- ogy of diagnostic problem solving [11]. The application ontology defines the declarative knowledge consisting of findings and their ranges, solutions and treatments. In the following, we enumerate and informally describe the different node types, before we give a toy example of a DiaFlux model in the introduced car fault diagnosis do- main: • Start node: A start node does not imply an action itself, but is a pseudo-node pointing to the node that represents the first action to take. Multiple start nodes can be modeled to provide distinct entry points into one DiaFlux protocol. • Test node: Test nodes represent an action for carrying out a single test on activation of the node at runtime. This may trigger a question the user has to answer or data to be automatically obtained by sensors or from a database. Furthermore, the acquired information re- fines the knowledge about the patient state. • Solution node: Solution nodes are used to set the evaluation of a solution, based on the observed find- ings. • Wait node: Upon reaching a wait node, the execution of the protocol is suspended until the given time has passed. • Composed node: DiaFlux models can be hierarchi- cally structured as already defined ones can be reused as modules, represented by a composed node. This fulfills the aforementioned goal of modularity. • Exit node: An exit node terminates the execution of a DiaFlux model and returns the control flow to the superordinate model. To express different results of a model, several distinct labeled exit nodes are sup- ported. • Comment node: For documentation of a protocol, comment nodes can be inserted at arbitrary positions. Though, they can be connected by edges and so be used to create semi-formal guidelines. They do not represent an action and are ignored during execution. Figure 2 shows a DiaFlux module for handling the prob- lem area “Battery”, which has been established by another - superordinate - module (not shown in Figure 2). It is em- bedded into the wiki article containing further informal in- formation about batteries. The execution of the module starts at the start node “Start” (1), which is pointing to the node “Battery voltage” (2). It is a test node, which asks the user for the actual voltage of the battery. As “Battery voltage” is a question with a numerical range, the conditions that can be modeled on the outgoing edges, are checks against disjoint intervals. In case the battery’s voltage is high enough to start the car (> 12.5V), the fault may be a connection problem to the electrical system. So, the user is asked whether the termi- nals are clean. In case they are, the problem has to be the automobile self starter or its cabling. Then, the execution reaches a solution node that evaluates the solution “Dam- aged Starter” as suspected (3). An attached comment node gives a hint for further elaboration, pointing out, that also the cabling may be the fault case. Having rusty terminals, a connection problem is likely. The instructions of how to clean the terminals are a self- contained module, that can readily be reused in this pro- tocol. Upon reaching the composed node “Clean Termi- nals” (4) the execution of the according protocol is started and “Battery Check” is stalled until the subordinate one is finished. In this example, the module “Clean termi- nals” exclusively consists of instructions to the user, so 250 Figure 2: A guideline for diagnosing a car battery, embedded into a wiki article containing further information. The pathway of the current testing session is highlighted in green. the only possible outcome of this procedure is “Terminals cleaned”(represented by an according exit node), which is the condition on the outgoing edge. After following the in- structions and finishing the module, the exit node “Battery ok” (5) is activated, returning execution to the superordi- nate protocol (not shown in Figure 2). In case the battery’s voltage is too low to reliably start the engine (below 12.5V), different solutions can be estab- lished. If the voltage lies below 11.9V, then the battery is exhaustively discharged and considered broken. Therefore, the solution “Battery damaged” is established and the exe- cution is ended by reaching the exit node “Fault detected”. If the voltage is in the range between [11.9V, 12.5V], then it is probably too low for activating the self starter, but can be recharged. After establishing the solution “Flat Battery”, the user is instructed to load the battery by the node “Load battery” (this basically is modeled by a one-choice question with the only one answer “ok”). Then the wait node (6) is activated suspending the execution for 12 hours. After this period has passed, the execution of the protocol is resumed. The user is asked for the actual battery voltage again, which is the battery’s maximum voltage. If the voltage still is too low to start the car (< 12.5V), a “Damaged Battery” can again be established. If the voltage is sufficiently high, the taken exit node “Battery ok” indicates, that there must be another cause for the fault. After returning to the superor- dinate protocol, further steps can be taken to find the cause of the fault, e.g., a damaged engine. The outgoing edges in the superordinate protocol starting at the composed node representing “Battery Check” can decide how to proceed further, depending on the taken exit node. Integration in KnowWE We created an implementation of DiaFlux for the knowledge-based system d3web [12]. DiaFlux offers the possibility to model and execute protocols that em- ploy declarative and inferential expressiveness provided by d3web. An AJAX-based editor for DiaFlux is integrated into the Semantic Wiki KnowWE (cf. Figure 3), using its plugin mechanism [13]. The DiaFlux editor is on the one hand able to reuse ontological concepts that are readily available in the wiki’s knowledge base. Those can simply be dragged into the flowchart. Depending on the type of object (find- ing, solution, DiaFlux model), a node of adequate type is created. On the other hand, the application ontology can be extended by creating new concepts from within the ed- itor with an easy to use wizard. The model’s source code is encoded in XML and integrated into the corresponding wiki article and saved and versioned together with it. This allows for further documentation of the protocol by tacit knowledge in the article. When the article is displayed in a web browser, the model visualization is rendered, instead of displaying its XML source code. A related wiki environment for the collaborative creation of clinical guidelines is the Modelling Wiki (MoKi) [14], based on Semantic Media Wiki [15]. Originally it was de- signed for the creation of enterprise models using a visual editor, but it also has been used in the Oncocure project [16] to acquire clinical protocols for breast cancer treatment. Therefore, templates were defined within the wiki and later their content was exported into skeletal Asbru plans [17]. Though MoKi’s visual editing capabilities for business pro- 251 Figure 3: The main module of the sepsis diagnosis and treatment protocol, opened in the AJAX-based editor. cesses, they were not employed to graphically model guide- lines. Furthermore, the created Asbru plans are currently not executable within the wiki. Development Process For the development of DiaFlux models we propose the idea of the knowledge formalization continuum [2]: At first, informal information can be collected in wiki arti- cles, e.g. about goals of a protocol. During the next step, a first semi-formal flowchart can be created using only com- ment, start and exit nodes and connecting edges. At this stage of formalization, the flowcharts can not be automat- ically executed, but “manually”. For testing purposes the user can run through the flowchart by clicking on that out- going edge of the active node, he wants to continue the pathway on. The taken pathway is highlighted for easier tracking. This especially is useful, when parallelism or hi- erarchically structured protocols are involved. The last step is the formalization into a DiaFlux model and the creation of the application ontology, resulting in a fully formalized and executable knowledge base. By following this process of gradual refinement, the entry barrier for domain special- ists is quite low, while knowledge acquisition can start from the beginning. Graphical modeling languages as a mediator to extract knowledge are also used in the IT-Socket of the research project plugIT [18]. Compared to our process of gradual re- finement within the same modeling language, the IT-Socket employs semi-formal graphical models created by domains specialists, that are later formalized on a different level of abstraction. Development Tools Collaborative development requires to track the changes of all participants. Therefore, a frequent task is to compare different versions of a wiki article. For this purpose in gen- eral, wikis provide a textual diff comparing two versions of an article. As a diff of the XML source code is not very helpful for comparing a visual artifact like a flowchart, a more understandable diff is provided. On the one hand, a textual summary of the added, removed, and changed nodes and edges is generated. On the other hand, the previ- ous and the current version of the DiaFlux model are shown next to each other, highlighting the changes in different col- ors for easy comparison, e.g. removed items are red in the previous version, added items are green in the current one, and changed items are highlighted in both versions. After creating a knowledge base in KnowWE, a test ses- sion can directly be started from the wiki article containing it. Having used DiaFlux models, the current state of the protocol throughout the session can be observed. The tra- versed pathway through the flowchart is highlighted, in a similar manner as in the visual diff (cf. Figure 2). This im- mediate feedback considerably eases the interactive testing of the knowledge base. 4 Case Study In the context of the project “CliWE” we used a prototype of the clinical wiki environment for the development of a protocol covering the diagnosis and therapy of sepsis. Sep- sis is a syndrome of a systemic inflammation of the whole body. Despite the high mortality of this critical illness (30 to 60%) there are two main problems in sepsis therapy. First, it is essential to recognize that a patient fulfills sepsis criteria and second, if sepsis is diagnosed a complex med- ical therapy has to be initiated quickly. Today, so called patient data management systems are available in many in- tensive care units. With these systems, medical data are electronically available. In this context a clinical decision support system may be a reasonable solution for the above outlined practical problems, monitoring all patients for sep- sis and support the physician in the initiation of sepsis treat- 252 Figure 4: An early semi-formal version of the sepsis protocol. ment. The knowledge base was developed in accordance to the official guideline by the German Sepsis Society [19]. It is a textual guideline of about 80 pages describing the preven- tion, diagnosis, and therapy of sepsis. Our formalization of the guideline contains so far the diagnostics and parts of the therapy together with some common tasks for pa- tient admission (cf. Figure 3). At the moment it contains about 50 nodes in eight modules with several possible path- ways, depending on how the diagnosis can exactly be es- tablished and the course of the therapy. The upper part of the main DiaFlux model contains knowledge about the de- cision making and the lower part contains knowledge about the treatment. The diagnosis task involves the assessment of up to eight clinical parameters (conducted in the modules “Septic pa- rameters” and “Extended septic parameters”) and an estab- lished or suspected infection. The monitoring is repeated until a sepsis can be established within different cycles de- pending on which parameters are acquired and their eval- uation. If there is enough evidence to support a suspected sepsis, then a warning to the clinician is generated. If the clinician agrees with the conclusion, the diagnosis “Sep- sis” is established and instructions for starting the therapy are given. The treatment for sepsis consists of the three bundles causal therapy (treating the cause of the infection), supportive therapy (stabilizing the patients circulation) and adjunctive therapy (supporting fighting off the infection). Those bundles are modeled as self-contained modules and reused as composed nodes in the main module. Experiences The knowledge acquisition mainly took place in two work- shops, approximately six hours each, involving two domain experts. The DiaFlux editor was handled by a knowledge engineer, entering the knowledge artifacts provided by the domain specialists. The remaining participants followed the authoring process on a projector. During the first session we followed the idea of the knowledge formalization continuum and started with tex- tual descriptions of most modules. As a second step, we created semi-formal flowcharts giving an outline of the pro- tocol, as exemplified in Figure 4. Next, we started to further formalize these flowcharts into executable DiaFlux mod- els and to create the according declarative knowledge. The second session began with the acquisition of test cases of typical sepsis patients. As they were only informally en- tered in a wiki article and not executable so far, we stepped manually through the model by highlighting the correct pathway. The found inconsistencies were corrected during the second half of the session, together with further elab- oration of the knowledge base. In a third session of about one hour, one of the experts created a small module by him- self, while being observed by a knowledge engineer. The expert shared his screen using an internet screen sharing software and was supported in formalizing the knowledge and the usage of the DiaFlux editor. Overall, the wiki-based approach showed its applicabil- ity and usefulness, as the combination of formal and in- formal knowledge and its gradual refinement was intensely used during the acquisition of the protocol and the test cases. Further, the developed knowledge base was accessi- ble to all participants immediately after the workshops, as it took place in a password protected wiki, which can be accessed over the internet. So far, the knowledge acquisition was conducted in workshops involving domain experts and knowledge engi- neers. After the initial workshops and the successful tele- knowledge acquisition session, we are confident to proceed with further workshops, that require minimal support by the knowledge engineers. 5 Conclusion This paper presented work in the context of the project “CliWE - Clinical Wiki Environments” for collaborative development and evolution of clinical decision-support systems. We introduced a language that can incorporate declarative and procedural diagnostic knowledge for mod- eling executable clinical protocols. Its main focus is sim- plicity for the usage by domain specialists. DiaFlux is in- tegrated into the Semantic Wiki KnowWE to support the collaborative development by a community of experts. The case study demonstrated the applicability and benefits of the approach during the development of a clinical protocol for sepsis diagnosis and treatment. Due to the wiki-based approach the knowledge can evolve easily. It is accessi- ble without depending on specialized software, as long as an internet connection is available. Furthermore, domain specialists can almost instantly start contributing. Formal- ization of the knowledge can then happen at a later time, after familiarizing with the semantics. As next steps we plan the integration of refactoring ca- pabilities into the editor, for the easier evolution of DiaFlux models. We will also enhance the tool support for the grad- ual formalization. As there rarely is only one single opinion in medicine, we will support different “medical schools” represented by the contributing experts by the possibility to engineer variants of DiaFlux models. In the future, we are planning the collaborative development by a community of experts, connected by KnowWE. 253 References [1] Schaffert, S., Bry, F., Baumeister, J., Kiesel, M.: Se- mantic wikis. IEEE Software 25(4) (2008) 8–11 [2] Baumeister, J., Reutelshoefer, J., Puppe, F.: Con- tinuous knowledge engineering with semantic wikis. In: CMS’09: Proceedings of 7th Conference on Com- puter Methods and Systems (Knowledge Engineering and Intelligent Systems). (2009) 163–168 [3] Buffa, M., Gandon, F., Ereteo, G., Sander, P., Faron, C.: SweetWiki: A semantic wiki. Web Semantics 8(1) (2008) 84–97 [4] Baumeister, J., Reutelshoefer, J., Puppe, F.: Markups for knowledge wikis. In: SAAKM’07: Proceedings of the Semantic Authoring, Annotation and Knowl- edge Markup Workshop, Whistler, Canada (2007) 7– 14 [5] Hommersom, A., Groot, P., Lucas, P., Marcos, M., Martı´nez-Salvador, B.: A constraint-based approach to medical guidelines and protocols. In Teije, A.t., Miksch, S., Lucas, P., eds.: Computer-based Medi- cal Guidelines and Protocols: A Primer and Current Trends. Volume 139 of Studies in Health Technology and Informatics. IOS Press (2008) 213–222 [6] Boxwala, A.A., Peleg, M., Tu, S., Ogunyemi, O., Zeng, Q.T., Wang, D., Patel, V.L., Greenes, R.A., Shortliffe, E.H.: GLIF3: a representation format for sharable computer-interpretable clinical practice guidelines. J. of Biomedical Informatics 37(3) (2004) 147–161 [7] Fox, J., Johns, N., Rahmanzadeh, A.: Disseminat- ing medical knowledge: the proforma approach. Ar- tificial Intelligence in Medicine 14(1-2) (1998) 157 – 182 Selected Papers from AIME ’97. [8] de Clercq, P., Kaiser, K., Hasman, A.: Computer- interpretable guideline formalisms. In ten Teije, A., Miksch, S., Lucas, P., eds.: Computer-based Med- ical Guidelines and Protocols: A Primer and Cur- rent Trends. IOS Press, Amsterdam, The Netherlands (2008) 22–43 [9] Peleg, M., Tu, S., Bury, J., Ciccarese, P., Fox, J., Greenes, R.A., Miksch, S., Quaglini, S., Seyfang, A., Shortliffe, E.H., Stefanelli, M., et al.: Compar- ing computer-interpretable guideline models: A case- study approach. JAMIA 10 (2003) 2003 [10] Wang, D., Peleg, M., Tu, S., Boxwala, A., Greenes, R., Patel, V., Shortliffe, E.: Representation primi- tives, process models and patient data in computer- interpretable clinical practice guidelines:: A literature review of guideline representation models. Interna- tional Journal of Medical Informatics 68(1-3) (2002) 59 – 70 [11] Baumeister, J., Reutelshoefer, J., Puppe, F.: KnowWE: A semantic wiki for knowledge engineer- ing. Applied Intelligence (2010) [12] Baumeister, J., et al.: The knowledge model- ing environment d3web.KnowME. open-source at: http://d3web.sourceforge.net (2008) [13] Reutelshoefer, J., Lemmerich, F., Haupt, F., Baumeis- ter, J.: An extensible semantic wiki architecture. In: SemWiki’09: Fourth Workshop on Semantic Wikis – The Semantic Wiki Web (CEUR proceedings 464). (2009) [14] Ghidini, C., Kump, B., Lindstaedt, S.N., Mahbub, N., Pammer, V., Rospocher, M., Serafini, L.: MoKi: The enterprise modelling wiki. In: ESWC’09: The Semantic Web: Research and Applications. Volume 5554 of LNCS., Springer (2009) 831–835 [15] Kro¨tzsch, M., Vrandecic´, D., Vo¨lkel, M.: Semantic MediaWiki. In: ISWC’06: Proceedings of the 5th International Semantic Web Conference, LNAI 4273, Berlin, Springer (2006) 935–942 [16] Eccher, C., Rospocher, M., Seyfang, A., Ferro, A., Miksch, S.: Modeling clinical protocols using Se- mantic MediaWiki: the case of the oncocure project. In: K4HelP: ECAI 2008 Workshop on the Knowledge Management for Healthcare Processes, University of Patras (2008) 20–24 [17] Miksch, S., Shahar, Y., Johnson, P.: Asbru: A task- specific, intention-based, and time-oriented language for representing skeletal plans. In: UK, Open Univer- sity. (1997) 9–1 [18] Woitsch, R., Utz, W.: The IT-Socket: Model-based business and IT alignment. In Weghorn, H., Isaı´as, P.T., eds.: IADIS AC (1), IADIS Press (2009) 141– 148 [19] German Sepsis-Society: Sepsis guideline. http://www.sepsis-gesellschaft.de/DSG/Englisch 254 Linked Data Games: Simulating Human Association with Linked Data Jo¨rn Hees, Thomas Roth-Berghofer, and Andreas Dengel Knowledge Management Department, German Research Center for Artificial Intelligence DFKI GmbH and Knowledge-Based Systems Group, University of Kaiserslautern, {firstname.lastname}@dfki.uni-kl.de Abstract Teaching machines to understand human com- munication is one of the central goals of arti- ficial intelligence. Psychological research in- dicates that human associations are an essen- tial requirement to understand human commu- nication. In this paper the hypothesis is pre- sented that simulating human associations with the help of Linked Data could improve text un- derstanding capabilities of machines. To inves- tigate whether human associations can be sim- ulated with Linked Data, two preliminary prob- lems are identified: (i) A reasonable ground truth for human associations is lacking and (ii) hu- man associations have different strengths while Linked Data treats all triples equally and does not provide edge weights. To overcome these prob- lems, two ideas for web games in accordance with Luis von Ahn’s Games with a Purpose are proposed trying to turn the tedious acquisition processes into fun games. The resulting datasets are then to be used for quantitative comparisons of human associations and Linked Data. 1 Introduction Since its introduction in 2001 the Semantic Web has gained much attention. In recent years, especially the Linked Open Data (LOD) project1 contributed many large, inter- linked and publicly accessible datasets, generating one of the world’s largest, distributed knowledge bases. The ac- cumulated amount of Linked Data can already be used to answer astonishingly complex questions (e.g., compiling a list of all musicians who were born in Berlin before 1900) or to provide additional information for selected concepts on a website (e.g., providing a short abstract and a thumb- nail when hovering over the name Barack Obama). In many ways Linked Data reminds of spreading activa- tion semantic networks [Collins and Loftus, 1975]. Spread- ing activation semantic networks are successfully used in psychology to model human associations (e.g., thinking of Barack Obama, one will most likely also think of USA). Human associations are an important processing ability of our memory allowing us to retrieve related thoughts, tra- verse from one thought to another and thereby facilitate our way of thinking. They are also crucial for our un- derstanding of everyday communication [Gerrig and Zim- bardo, 2010, pp. 240ff] as they help us to build a context 1http://esw.w3.org/SweoIG/TaskForces/ CommunityProjects/LinkingOpenData and resolve ambiguities. Hence it seems plausible that sim- ulating such associations could help to improve text under- standing capabilities of machines. This work investigates the question if and how it is possi- ble to simulate human associations with the help of Linked Data. A first analysis identifies two problems: Currently, a quantitative comparison between Linked Data and human associations is impossible due to the lack of a reasonably large ground truth of human associations. Such a ground truth would consist of a large number of as- sociation pairs (e.g., (Barack Obama, President of the US)) collected from many different test persons. Collecting such a ground truth would allow us to answer questions such as: how large is the overlap between Linked Data and human associations? Nevertheless, due to the desired size, the ac- quisition would be infeasible with traditional approaches, such as paying test persons to record their associations. The second problem is that human associations have dif- ferent strengths, while Linked Data treats all triples equally and currently does not provide edge weights. Solving this problem would for example allow us to ask a machine to only show us the 20 strongest associations related to a re- source, which in turn could be used to narrow down search spaces, use spreading activation algorithms in a meaning- ful way, or rank the results by association strengths. While several approaches try to rate triples by heuristics, none of them was compared to a dataset of human association strengths. Nevertheless, the acquisition of such a dataset would require us to assign weights to a very large num- ber of Linked Data triples, which again would be infeasible with traditional approaches. As both datasets need to be collected in order to inves- tigate if it is possible to simulate human associations with Linked Data, two ideas for games in accordance with Luis von Ahn’s Games with a Purpose [von Ahn and Dabbish, 2008] are proposed, turning the tedious process of entering associations or ratings into fun games. The remainder of this paper is structured as follows: First the current state of the art is presented. Then a brief in- troduction into human communication is given explaining the important role played by associations in the language understanding process. In the following section human as- sociations are conceptually compared to Linked Data, re- sulting in the two problems outlined above. For each of these problems a game idea is proposed, preceding the fi- nal conclusion and outlook. 2 State of the art In recent years the research community devoted much at- tention to the Semantic Web [Berners-Lee et al., 2001], 255 which proposed the vision of sharing information in the web not only to other humans, but also transforming it into a meaningful form for machines. Two of the most famous outcomes of these developments are the Resource Descrip- tion Framework (RDF)2 and the Web Ontology Language (OWL)3. Still, as manually providing information in these languages is quite cumbersome and seldom results in an immediate benefit for the authors, a few years went by without a reasonable amount of RDF being published on the web. Realizing this, a kind of grassroots develop- ment began to publish data which was already available in several centralized, corporate or governmental databases. Quickly guidelines were developed for publishing such data and interlinking it with others. With the ongoing help of volunteers and famous proponents, the so called Linked Open Data project was born, nowadays forming the so called Linked Data Cloud of interlinked datasets, including over 13.1 billion RDF triples as of November 2009. The content of this cloud, often called Linked Data, poses the world’s largest distributed knowledge base and can be seen as a first challenge of the vision of the Semantic Web. As more and more datasets were integrated into this Linked Data Cloud certain centralized points evolved, one of them being DBpedia4. The DBpedia team tries to au- tomatically extract structured information from Wikipedia articles. In contrast to many other datasets, DBpedia rep- resents very much knowledge across a large number of do- mains, which makes it very interesting for tying domains together. At the same time due to the automatic nature of the extraction, DBpedia also introduces a lot of errors into the Linked Data Cloud. Even long before the Semantic Web started to evolve, psychological research has been very busy in the field of how human beings understand language. While the next sections go more into detail on spreading activation seman- tic networks [Collins and Loftus, 1975], it shall be men- tioned here that they belong to the human semantic mem- ory, which is a part of the so called explicit memory [Bad- deley et al., 2009, pp. 113–121]. As mentioned in the introduction, in this paper two games are proposed that try to turn the otherwise unfeasible work into fun games. This is motivated by Luis von Ahn’s Games With A Purpose5, which are part of a field called Human Computation. After the big success of the famous ESP Game [von Ahn and Dabbish, 2004], which turns the tedious process of labeling images into a fun game, and further games, a summary of design principles for Games With A Purpose [von Ahn and Dabbish, 2008] was pub- lished. The proposed game ideas are especially related to the ESP Game, Verbosity [von Ahn et al., 2006], Matchin [Hacker and von Ahn, 2009] and OntoGame [Siorpaes and Hepp, 2007]. Verbosity is a game which turns the widely known Tabu game into a game collecting common-sense facts. The game is of an asymmetric type, which means that there’s a describer who gets a word that she has to de- scribe to the guesser. The guesser can see the describer’s output and guesses her input. In order to prevent cheating and to direct the collection of common-sense facts, the de- scriber can not enter text freely, but has prepared snippets, which can be completed, such as “is a”, “has a”, “is the 2http://www.w3.org/TR/rdf-primer/ 3http://www.w3.org/TR/owl-ref/ 4http://dbpedia.org 5http://www.gwap.com Symbol Thought Referent p1:B.Obama Barack Obama Figure 1: Semiotic Triangle after Ogden & Richards (left) and an example (right) opposite of”. If human associations were extracted from the facts collected with this game, they would be strongly biased towards these snippets. Besides this, the results of the game do not seem to be published. Matchin is a game, which presents pairs of images to the players and asks them which one their partner will prefer. The collected amount of relative votings is then used to globally rank the pictures. OntoGame can be seen as kind of the first game of this kind applied to Linked data. Nevertheless, it solves a very differ- ent task than the later on proposed games, namely finding out whether a Linked Data resource is an instance or a class and then trying to file them into a taxonomic structure. 3 Human Communication Human communication is the process of transporting infor- mation from one human being to another.6 In such commu- nication we can distinguish between symbols, THOUGHTS and referents as they can be visualized in the so called semiotic triangle (Figure 1). This distinction allows us to see one of the main characteristics of human communication: Thoughts heav- ily depend on the respective person, and we are not able to exchange thoughts directly. A THOUGHT (e.g., P1:B.OBAMA) is someone’s mental representation of some referent (e.g., Barack Obama, the one person with that name, currently being president). Instead of exchang- ing a thought directly, we are only able to exchange a sym- bol for the thought in written or spoken form (e.g., the two words Barack Obama). As a speaker or writer we then hope that the listener or reader of such a symbol finds an own thought that is sufficiently similar to our thought (see Figure 2, P2:B.O.), or creates a new thought for what we described. Communication is not limited to the exchange of sin- gle thoughts, but is all about exchanging information (i.e., the connections between thoughts). As we can not ex- change thoughts directly, we can only exchange informa- tion by a from of symbolic indirection. From a Semantic Web point of view, information can be expressed as sim- ple statements. Each statement essentially is a simple sen- tence in the form of (subject, predicate, object)-triples, but instead of the usual symbolic form of a sentence now sub- ject, predicate and object are thoughts. The exchange of the triple (P1:B.OBAMA, P1:BIRTHPLACE, P1:HONOLULU) from person p1 to person p2 via the described symbolic in- direction is visualized in Figure 2. Note that the thoughts of p1 and p2 are disjoint as they are in different brains, but hopefully they are similar enough, so that p2 under- stands p1. Also note that P2:B.O. is more detailed than P1:B.OBAMA, as p2 also knows a more specific name. From this simple example one can identify two problems that can occur in our communication: • A thought can be referred to by several symbols (e.g., Barack Hussein Obama II, Obama, President of the 6We focus on written or spoken communication here. 256 p1:B.Obama p2:B.O. Barack Obama Barack Hussein Obama II p1:Honolulu p2:Honolulu Honolulu Honolulu was born in was born in Barack Obama Barack Obama was born in Honolulu. p1:birthPlace p2:bP p1 p2 Legend: : Thought : Symbol : Communication : Real world thing / Referent Figure 2: Human to human communication United States, President Obama all could refer to one single thought P2:B.O.). • A symbol can refer to several thoughts, also known as lexical ambiguity (Apple might refer to APPLE (the fruit) or APPLEINC (the company)). 3.1 Human associations Now, when we read or hear some symbol (e.g., Barack Obama) we instantly connect it with its thought(s) (e.g., P1:B.OBAMA), but also, we somehow remember re- lated thoughts (e.g., P1:PRESIDENTUSA, P1:USA, . . . , P1:HONOLULU), called associations. Associations can be thought of as a graph of thoughts or a so called spreading activation semantic network [Collins and Loftus, 1975], in which each thought is a node and associations between these nodes are edges. An example for such a semantic network is depicted in Figure 3 and shows some possible associations of P1:B.OBAMA. The length of an edge represents the strength of the association (e.g., P1:B.OBAMA’s association to P1:PRESIDENTUSA is stronger than to P1:HONOLULU). 3.2 Context It is argued that such associations are key requirements for our daily communication [Gerrig and Zimbardo, 2010, pp. 240ff]. Whenever we read or hear a symbol, our brain looks up its connected thought(s). These thoughts and their associations (in the following called associations of the cor- responding symbol) are used to generate and adjust a con- p1:B.Obama p1:PresidentUSA p1:Democrats p1:Politician p1:Honolulu p1:MichelleObama p1:VicePresidentUSA p1:USA p1:Hawaii p1:JoeBiden Figure 3: Semantic Network text of thoughts that are related to the current communi- cation sequence. The associations of a symbol may either support the current context or oppose it. In the first case our associations reinforce the context and our confidence that we understood what the speaker was telling us and that we are thinking about similar thoughts rises. In the lat- ter case we have an indication that our lookup of the sym- bol’s thought was wrong and we struggle to find a different meaning for what we have heard so far (e.g., in the sen- tence “Last year the pen was abandoned as it was too dirty for the animals to live in.” [Gerrig and Zimbardo, 2010, p. 241] one would first assume that “pen” was a writing instrument, as it does most of the time (so called biased ambiguity), but when reading about animals living in it we suddenly realize that it refers to an enclosure). Such a context, which emerged from our ongoing inte- gration of associations, can be seen as our condensed set of thoughts related to the current communication. We can also think of this context as a working space, a drastically reduced search space for things that we expect to be related to what the speaker / writer expressed, instead of consider- ing every thought we have in our mind to be potentially related. The context hence allows us to do much faster lookups (e.g., for symbols of thoughts which until now only were mentioned implicitly or for the integration of new in- formation into our memory) and gives us the ability to re- solve ambiguities. 4 Comparing Linked Data with human associations Linked Data and spreading activation semantic networks used to model human associations are very similar. Both can be seen as network structures with thoughts as nodes and edges in between, both of huge dimensions. In contrast to human communication, “thoughts” in Linked Data are meant to be exchanged directly between machines, which means without the symbolic indirection that we humans have to use. In contrast to the human brain the evolving knowledge base is thus distributed and inter- linked, which is at the same time one of its largest advan- 257 tages and disadvantages. For example it allows continu- ing growth and allows machines to exchange information without reasoning about whether their symbols mean the same things, but at the same time this freedom introduced many errors into Linked Data, as anyone can issue state- ments about anything or as people use properties in differ- ent ways (cf. the use of owl:sameAs [Halpin and Hayes, 2010]). Another difference between Linked Data and human as- sociations is that Linked Data has strongly typed edges, which identify the kind of relation between two concepts very precisely, while human associations do not. Usually it takes us much longer to find a name for the associa- tion of two thoughts than it takes us to name the associated thought. Furthermore, human associations do not seem to be limited to a certain type of relation between thoughts, even though it would be interesting to investigate which of these properties we use most frequently. To be able to investigate questions as the one above, and more general to perform any quantitative comparison be- tween Linked Data and human associations, a large ground truth of human associations is needed, which currently does not exist. Except from the need for such a ground truth, until now only differences were mentioned, which render Linked Data more specific than human associations and hence would allow to easily simulate human associations with Linked Data by just ignoring details such as edge labels. Still there exists one key feature of human associations, which currently is not part of Linked Data: association strengths. Even though heavily dependent on the person, the current context and task, nearly all humans will agree that they generally associate Barack Obama stronger with USA than with Honolulu. In contrast to this, the facts in the Linked Data cloud are facts in a logical sense. They are as- sertions, all of the same “truth”, none being more valuable than another. A collection of such association strengths would allow us to ask machines not only to give us all information about an instance (e.g., all 600 triples for Barack Obama), but also to rank this information by association strength (e.g., only presenting the top 10 of these to an end user) and thus, to constrain the number of results. With regard to under- standing human texts, this would allow us to propagate ac- tivation from thoughts whose symbols directly occur in the text to thoughts occurring only implicitly in a human like way. This in turn, would enable us to narrow down our search space from the whole Linked Data graph to only those thoughts associated with the current communication sequence by an “average human”. Another immediate ben- efit from annotating Linked Data triples with an associa- tion strength is a kind of feedback for automated extraction processes such as the one underlying DBpedia. One could investigate, which extraction rules yield high and which ones yield low association strengths, possibly driving an improvement process. Besides these immediate uses, such a collection of asso- ciation strengths would also allow us to test whether current common heuristics truly model how we associate thoughts and if they do, they could be used to bootstrap the acquisi- tion of associations strengths. Examples for such heuristics include word co-occurrences on websites or graph intrinsic features such as page rank, betweenness and the like, try- ing to model how much activation flows from one thought to another. For further research it is thus proposed to generate both: a human association ground truth as described earlier and a collection of Linked Data triples rated with their associa- tion strengths. 5 Linked Data Games In order for both datasets to be of any use, their size needs to cover a sufficiently large amount of human associations / Linked Data. The larger the amount of data collected, the more meaningful later quantitative comparisons will be. At the same time all data collected (e.g., associations of Barack Obama or whether USA is stronger associated to him than Honolulu), is highly subjective, resulting in a large amount of input from different persons to be needed in order to arrive at a collectively agreed association / rat- ing. Additionally entering a large amount of associations or ratings is an extremely tedious task. The former constraints render the application of tradi- tional approaches to generate a ground truth, such as pay- ing test persons to record or write down their associations or ratings, infeasible. Nevertheless, there exists a novel ap- proach called Human Computation [von Ahn and Dabbish, 2008] that might be suitable to solve this acquisition prob- lem. The approach suggests to turn problems which are difficult to solve for machines into fun games with atomic decisions which are easily answerable for us humans. In the following for each of the needed datasets an idea is presented on how the individual problem could be turned into such a game. 5.1 Building a human association ground truth In order to acquire a human association ground truth one would usually present an item to a participant and ask her to enter all associations she has with this item in free text in a fixed amount of time (e.g., one minute). Then the next item would be shown. In order to make the collected data robust against priming effects, the order of the items usu- ally would be randomized. In order to turn this tedious process into a game many aspects of the ESP Game [von Ahn and Dabbish, 2004] can be borrowed. In contrast to the ESP Game the items of this game are not web images but Linked Data resources (i.e., their symbols). In its simplest form the envisioned game called Associa- tor is a symmetric two player output agreement web game, where a player starts a game and is randomly grouped with some other player for a short, fixed period of time (e.g., 2 minutes). Both players then play in rounds and can not communicate by other means than described in the following. At the start of each round they see an input item S (e.g., Barack Obama). What makes this game a Linked Data game is that this item is a symbol of a Linked Data resource R (i.e., there is a triple (R, RDFS:LABEL, S) in the LOD cloud (e.g. (DBPEDIA:BARACK OBAMA, RDFS:LABEL, Barack Obama)). Both players are then asked to enter what their partner will associate with this item. The round will last until any of the outputs of player p1 matches any of the outputs of player p2 or until both players decide to pass this item, as they can not agree on any association. All outputs and timings in this process are recorded. In case of a match both players get points de- pending on the time it took them to find the match and get into the next round. In case of a pass both players get into the next round without getting points. 258 What this game is going to return behind the scenes is a collection of associations between the input items (i.e., Linked Data resources’ symbols) and user entered sym- bols, which actually represent associated thoughts as we know. The matches recorded in this process are very good candidates for collectively agreed associations of the pre- sented item. The other non matching guesses as well pose associations, but are very susceptible to subjective associ- ations, cheating and the like. Still any recorded round can be used for single player games, for example when there is an unequal amount of players who want to play or when a player quits out of a running game. In this case the current player is not matched with another player who is playing at the same time, but one whose session was recorded ear- lier. This single player process can then validate additional guesses of the recorded session. In contrast to the ESP Game a couple of open questions remain trying to use the collected data to build a human association ground truth. First of all this simplest version of the envisioned game does not include taboo words. Taboo words are a list of words shown to the players in addition to the round’s item. Any word in the taboo list can not be used as a match to get into the next round. In the ESP Game taboo words are used to force the players to enter a larger variety of labels, by adding outputs that are agreed on for a certain amount of times to the corresponding input’s taboo list. The prob- lem with this approach is that taboo words certainly bias the association process in a way that not only the round’s input item primes associations, but also the taboo words (they sort of work as additional inputs). One of the ways to weaken this problem is randomly selecting taboo words out of an actually larger list of taboo words, as it is proposed for the ESP Game. Still it might be interesting to investigate if it is not possible to completely eliminate this kind of bias- ing, e.g., it is proposed here to try a kind of covered taboo list. In this case the players would only see a list of cov- ered words, indicating that there are taboo words. In case the player enters a word from this list, it blocks a potential match, gets revealed (for this player only) and the player is awarded with some small amount of additional points. Another open problem is related to the fact that all collected associations actually are symbols for associated thoughts. In order to analyze whether these associations are part of the LOD cloud already, the symbols need to be matched back to Linked Data resources, where possible. This process often is very ambiguous and it is not clear if this symbol to resource resolution can be seamlessly inte- grated into the Associator game without introducing new distortions. Also it should be investigated how to select the input items, which in the presented form are symbols of randomly selected Linked Data resources. 5.2 Rating Linked Data triples with association strengths While the previous game idea targets creating a human association ground truth, in this section an idea shall be presented how to turn the process of assigning association strengths to Linked Data triples into a game. A simple traditional approach for this task could show all Linked Data triples related to a previously selected Linked Data resource (e.g., all triples with dbpedia:Barack Obama as subject and/or object) to a test person and ask her to or- der these triples according to which one of the relations she would have thought of first, when hearing or reading Barack Obama. After ordering the list, the next list for an- other Linked Data resource is shown. Before presenting the idea to turn this process into a game, two things shall be mentioned: First, as showing a list of URI triples to the end-user is not of much use, the users will always see a symbolic representation of this list. Luckily Linked Data resources are usually labeled with a symbol with the rdfs:label datatype property. Second, the outcome of each of these experiments, which is a user centric absolute ranking, is not only highly subjective, but sometimes even unstable for one person, as a lot of relative decisions are involved within this process and the human brain tends to lose track of them. Hence it seems to be bet- ter to ask for these atomic relative comparisons of two facts and then use an objective ranking algorithm to generate an absolute ranking out of them. Generating an absolute rank- ing out of such results can be compared to chess ranking systems, where based on the outcomes of atomic competi- tions (player p1 won against p2), a global ranking is calcu- lated, just that in this case there are no players competing, but facts. The envisioned game is called BetterRelations and bor- rows many ideas from Matchin [Hacker and von Ahn, 2009], which asks users to compare images against each other and calculates a global ranking from this. In its simplest form BetterRelations is a symmetric two player decision agreement web game. A player starting the game is randomly matched with some other player ei- ther for a predefined amount of time (e.g., two minutes) or a predefined amount of rounds (e.g., 50). At the start of each round both players are presented with one input item, which actually is a Linked Data resource’s symbol (e.g., Barack Obama), and two facts about this item (e.g., is pre- sident of the USA and was born in Honolulu). Both play- ers are then asked to select the fact that their partner will have thought of first. In case both players agree they are rewarded with points and get into the next round. In case of disagreement, both players get into the next round but do not get points. As in Matchin the amount of points col- lected in a round rises with the number of decision agree- ments of both players in a row, punishing a disagreement without actually subtracting points from the user and pre- venting cheat strategies to gain many points by simply se- lecting a random fact as fast as possible. In order to prevent another obvious cheating strategy, the facts are presented to both players in randomized order. All actions in this pro- cess are recorded, and for example used for a single player mode. Behind the scenes this game acquires a large amount of relative decisions between facts. Decisions which both players agreed on are especially validated. Disagreements could mean a lot of things and could result from subjective rankings and cheat attempts, or could indicate that the two facts were equally important. In contrast to Matchin, BetterRelations will not create one globally ranked list for all Linked Data triples, but in- stead is going to create a list for each Linked Data resource of interest and all facts related to this resource. The ranking algorithm, which transforms the relative ratings into these global ratings hence has to deal with a lot smaller lists than in the case of Matchin. Also the algorithm shall be able to quickly exclude a large number of erroneous facts, as they occur in Linked Data, from being played again, in order not to bore play- ers with such facts. One possibility would be to provide 259 the players with an explicit button saying “both facts are nonsense”. Nevertheless, it has to be investigated how to include this third choice into the rewarding system without abandoning the effective cheating prevention mechanisms. Last but not least, BetterRelations is a game of a very exploratory nature. Often it could happen that the player does not recognize the round’s item and might want to read a few sentences, explaining what the resource is about. For many Linked Data resources such a short in- troduction exists in the form of a rdfs:comment or a dbpedia-owl:abstract, but it is unclear how this in- formation can be included in the game without interfering with the rating choices. One promising possibility could be to include such information on demand only and to treat the resulting rating in a different way on the server side. Also as it generally could take a few seconds for the user to switch context, it should be investigated if the through- put of the game can be raised by playing a few consecutive rounds about the same item, only changing the facts, with- out introducing priming effects. 6 Conclusion and Outlook This paper introduced and motivated the hypothesis that simulating human associations could improve text under- standing capabilities of machines. Thanks to Linked Data we have a very large and promising dataset at hand to sim- ulate human associations. Nevertheless, in order to investi- gate more thoroughly whether Linked Data really is a good dataset to simulate associations, first of all a reasonably large human association ground truth is needed. Also as human associations can be of different strengths, such as- sociations strengths would need to be annotated to many Linked Data triples. As generating both of these datasets would be infeasible with traditional approaches, for each of them an idea was presented how to turn the acquisition process into a “Game with a Purpose”. After presenting these ideas the next research steps in- clude implementing several versions of the games and properly evaluating them with respect to the desired data quality, the throughput, average lifetime play and expected contribution as mentioned in [von Ahn and Dabbish, 2008]. The resulting datasets are then to be used to investigate the overlap of human associations and Linked Data, to rate the extraction process of DBpedia and to benchmark sev- eral heuristics used to infer non-existent edge weights for Linked Data. Acknowledgements Special thanks to Benjamin Adrian for the many insightful conversations about this field of re- search. References [Baddeley et al., 2009] Alan Baddeley, Michael W. Eysenck, and Michael C. Anderson. Memory. Psychol- ogy Press, 2009. [Berners-Lee et al., 2001] Tim Berners-Lee, James Hendler, and Ora Lassila. The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American, 284(5):34–43, May 2001. [Collins and Loftus, 1975] Allan M. Collins and Eliza- beth F. Loftus. A Spreading-Activation Theory of Se- mantic Processing. Psychological Review, 82(6):407– 428, 1975. [Gerrig and Zimbardo, 2010] Richard J. Gerrig and Philip G. Zimbardo. Psychology and Life. Allyn & Bacon, Pearson, Boston, USA, 19th edition, 2010. [Hacker and von Ahn, 2009] Severin Hacker and Luis von Ahn. Matchin: Eliciting User Preferences with an On- line Game. In Proc. of the SIGCHI Conference on Hu- man Factors in Computing Systems, pages 1207–1216, Boston, MA, USA, 2009. ACM. [Halpin and Hayes, 2010] Harry Halpin and Patrick J. Hayes. When owl: sameAs isn’t the Same: An Anal- ysis of Identity Links on the Semantic Web. In Chris- tian Bizer, Tom Heath, Tim Berners-Lee, and Michael Hausenblas, editors, Proc. of the WWW2010 Workshop on Linked Data on the Web (LDOW), Raleigh, USA, 2010. CEUR Workshop Proceedings. [Siorpaes and Hepp, 2007] Katharina Siorpaes and Martin Hepp. OntoGame: Towards Overcoming the Incentive Bottleneck in Ontology Building. In Proc. of the 3rd In- ternational IFIP Workshop On Semantic Web & Web Se- mantics (SWWS), Vilamoura, Portugal, 2007. Springer, LNCS. [von Ahn and Dabbish, 2004] Luis von Ahn and Laura Dabbish. Labeling Images with a Computer Game. In Proc. of the SIGCHI Conference on Human Factors in Computing Systems, pages 319–326, Vienna, Austria, 2004. ACM New York, NY, USA. [von Ahn and Dabbish, 2008] Luis von Ahn and Laura Dabbish. Designing games with a purpose. Commu- nications of the ACM, 51(8):58–67, August 2008. [von Ahn et al., 2006] Luis von Ahn, Mihir Kedia, and Manuel Blum. Verbosity: A Game for Collecting Common-Sense Facts. In Proc. of the SIGCHI Confer- ence on Human Factors in Computing Systems, pages 75–78, Montre´al, Que´bec, Canada, 2006. ACM, New York, NY, USA. 260 Abstract This paper presents on a novel knowledge and experience management approach for the devel- opment of patient-individual formulations in pharmacies applying case-based reasoning to solve problems with formulations that are under construction. A detailed analysis of the tasks and problems of formulating mixtures is given. A so- phisticated domain model, a case base for prob- lem solving and some first services are described that support pharmacists in their daily formula- tion tasks. The evaluation results from a first field test are presented. A brief discussion of re- lated work and an outlook conclude the paper. 1. Introduction Pharmacies are not only dispensing medications that have been fabricated by the pharmaceutical industry. They are also compounding medications for individual patients. The development of a formulation for an efficient, safe, and stable mixture requires plenty of task-oriented, phar- maceutical knowledge. A huge amount of this knowledge is documented in heterogeneous information sources like pharmaceutical textbooks and electronic publications1. It is a difficult and time-consuming task to check all relevant properties of a new mixture. Even experienced pharmac- ists have to review the literature carefully for relevant information, as there is such a wide variety of prescrip- tions and prefabricated substances. In current pharmacy practice, a profitable creation of patient-individual formu- lations is hardly possible. Task-oriented, electronic assis- tance is urgently needed. This paper presents a knowledge and experience man- agement approach [Bergmann, 2002] that supports phar- macists in formulating drugs for individual patients by means of a pharmaceutical domain model and case-based reasoning. It focuses on modeling pharmaceutical know- ledge on substances, properties, and relationships of them, using the properties and relationships as check criteria in order to examine formulations for efficacy, safety, and stability, and for providing drug information, and employ- ing case-based reasoning for reusing cases that record solutions of hurt check criteria. 1 A prominent German sample for an electronic source of over 500 formulation-related references is the formularium NRF (‘Neues Rezeptur-Formularium’) of the ABDA (‘Bundesvereinigung Deutscher Apothekerverbände’), on- line available at http://www.pharmazeutische- zeitung.de/index.php?id=2264, Retrieved October 24, 2008. The paper is organized as follows: In Section 2, we in- troduce the tasks and problems of patient-individual for- mulation. Section 3 addresses our knowledge and expe- rience management approach in order to support the for- mulation tasks. Section 4 presents first evaluation results from a field test. Section 5 concludes the paper with a discussion and an outlook. 2. Development of a patient-individual for- mulation A pharmacist develops a patient-individual formulation usually on the basis of a medical prescription. The pre- scription form contains a list of quantified substances called prescription positions. Figure 1 shows a sample prescription with a prefabricated salve ‘Volon A’ in the first position and a basic skin cream ‘Basiscreme DAC’2 in the second position. The ‘aut idem’ label on the left hand side means that the substance may be substituted by a generic drug. Fig. 1: Sample prescription form handed in at a pharmacy. The pharmacist extracts the relevant information from the prescription in order to acquire an initial formulation. The initial formulation has to be checked carefully and to be adapted if necessary. With the availability of the final formulation, the expiration date and further drug informa- tion can be determined. 2 DAC stands for the German drug codex ’Deutscher Arz- neimittel Codex’, which specifies the components of some basic pharmaceutical substances. The DAC is part of ABDA’s NRF (see previous footnote). Reuse of Pharmaceutical Experience on Patient-individual Formulations Mirjam Minor Michael Raber University of Trier St. Barbara Apotheke D-54286, Trier, Germany D-54290 Trier, Germany minor@uni-trier.de apotheke-trier@arcor.de 261 2.1 Creating an initial formulation from a pre- scription The first step of drug formulation is to create the initial formulation from the prescription. Several prescription- related challenges have to be addressed concerning the substances to be processed: 1. Synonymous names for substances, 2. Heterogeneous units of measurement, 3. Prefabricated substances. ad 1: Substances can be named by a variety of syn- onymous names. For instance, the common dermatologi- cal agent ‘urea’ is often prescribed by its Latin form ‘car- bamidum’, or by ‘carbamide’, ‘carbonyl diamide’, or ‘carbonyl diamine’. Sometimes, also trade marks like ‘Aspirin’ instead of ‘acetylsalicyl acid’ are taken. A pharmacist is hardly able to know all the names for sub- stances that may be used by the physicians. In practice, she successively learns the preferences of particular phy- sicians only in addition to her own pharmaceutical lan- guage use. Novel names have to be looked up when they occur. ad 2: The second difficulty is the variety of measure- ments in which the substances can be denoted. Mass val- ues can be seen alongside volume and proportion specifi- cations.3 A missing unit means that the value is inter- preted in gram. The supplementary ‘ad’ stands for filling up the whole mixture with this substance until the speci- fied amount. For instance, the ‘ad 100 g’ in the above sample prescription specifies that the pharmacist should add an amount of 95 g of this basic cream. Frequently, the prescription positions are given in different units. That means that the units of measurement have to be trans- formed into unified, explicit specifications. ad 3: Positions with an explicated unit of measurement may still suffer from incomplete knowledge concerning the content. This is to be found if a compound substance is denoted as it is frequently the case with prefabricated substances. Sometimes, detailed information on the par- ticular ingredients is missing. ‘Volon A’, for instance, contains 0.01 g Triamcinolon per gram, i.e. 0.05 g in the whole sample mixture above. Furthermore, the vendor’s specification of ‘Volon A’ says only that it contains Po- lyethylen and Paraffin. The proportion of these two aux- iliary substances is not published by the vendor. The pharmacists have to estimate the amounts of subcompo- nents if necessary. The estimation can be feasible over multiple assembly levels. When the pharmacist has solved the prescription- related problems that we described above, the initial for- mulation is available. Before the mixture is actually made, the initial formulation has to be reviewed for efficacy, safety, and stability. 2.2 Validating a formulation and solving prob- lems The review of an initial formulation requires pharmaceuti- cal knowledge on the basic and prefabricated substances, 3 The ratio ‘% (m/m)’ stands for ‘mass per mass’, for in- stance ‘% (g/g)’ for ‘gram per gram’. ‘% (m/V)’ is for ‘mass per volume’ like ‘% (g/ml)’. ‘% (V/V)’ units for ‘vo- lume per volume’ are quite common as well as mass pro- portions with respect to drops, pieces, or international units. their properties and relationships, as well as on the partic- ular patient, and use. Obviously, the concentration of active components is very important, but also potential problems with particular substances and mixtures have to be considered. The check criteria for validating a formula- tion can be assigned to the following three areas: I. Physical, chemical, and galenic characteristics of a certain substance or a group of substances, II. Physical, chemical, and galenic characteristics of a combination of certain components, III. Microbiological quality of a mixture. ad I: First, the pharmacist confirms that the particular substances are medically unobjectionable for a certain patient and use. Phenol, for instance, is not any more recommended to be applied on the skin. Furthermore, she checks whether the concentrations of the active compo- nents of the drug, i.e. of the medical and preservative agents, are within the mandatory range. Further characte- ristics to be validated are, for instance, the range of pH- values in which a substance should be formulated best or which impact a type of wrapping has on the stability of a particular substance. ad II: Second, the pharmacist clarifies that there are not any serious incompatibilities of components of the formu- lation. This concerns, for instance, the physical stability of two components in a mixture. The pharmacist knows whether an agent is soluble sufficiently in an auxiliary substance like olive oil or whether problems like sedimen- tation or caking are to be expected. ad III: The microbiological quality of a mixture is not only depending on the use of preservative substances. It also plays a role to avoid the generation of a nutrient me- dium within the mixture by a certain combination or treatment of substances. Sometimes, the decisions of the pharmacist rely on incomplete knowledge for the reason of incomplete specifications of substances that we have described above or for other reasons. If the pharmacist detects a problem during the valida- tion of a formulation it can be solved in different ways. Sometimes the pharmacist is able to mend the potential problem without changing the formulation, for instance with a soon expiration date or with adding the instruction that the drug has to be refrigerated. If a major adaptation of the formulation, not such as the addition of a preserva- tive agent, becomes necessary the prescribing physician has to be contacted in order to develop a solution con- jointly. A sample adaptation that requires the approval of the physician is depicted in Figure 2. Sometimes, the physician decides not to adapt the formulation, for in- stance for the reason of tradition or for intentionally pre- scribing a placebo that is effective psychologically only. 262 Uncommittedly composed formulation Acid. salicyl. plv. Triamcinolonacetonid Ol. oliv. 5.0 0.1 ad 100.0 Problem  solubility of Salicyl acid in olive oil 2.5 % only; consequence: sedimentation, growth of crystals  Triamcinolonacetonid not solvable in olive oil, consequence: sedimentation, caking  restricted temperature exposure for vegetable oils  solubility by shaking the drug is not sufficient Alternative Salicyl acid Triamcinolonacetonid 2-Propanol Octyldodecanol 5.0 0.1 10.0 ad 100.0 Directions: solve Salicyl acid in Octyldodecanol while heating and separately Triamcinolonacetonid in 2- Propanol. Let both solutions cool down and mix them at room temperature. Fig. 2: Sample adaptation of a formulation [RLH03, own translation]. 3. Knowledge and experience management approach The analysis of the development of patient-individual formulations in Section 2 has shown that it would be worthwhile to have an assistant system for the manage- ment of the required pharmaceutical knowledge and expe- rience. In the following, we will describe the modeling and reuse of pharmaceutical knowledge in a knowledge and experience management system. The system supports the following tasks of formulation:  Acquisition of initial formulation,  Retrieval of relevant check criteria,  Review of check criteria,  Mending of hurt criteria (deactivate criterion, adapt formulation: re-calibrate, change list of po- sitions),  Generation of additional instructions for how to prepare and use a mixture. The degree of automatic support that is provided by the system varies from task to task: The acquisition of an initial formulation from a prescription form is done main- ly automatically but requires user interaction for the ac- quisition of missing information on ingredients and amounts within prefabricated substances (compare Sec- tion 2.1). The retrieval and review of the check criteria is performed automatically. It addresses the validation areas I – III that have been introduced in Section 2.2. The mending of hurt criteria is supported by a case-based approach. The generation of instructions has still to be done by the pharmacist. In order to realize the support capabilities mentioned above, the assistant system provides a set of services us- ing an underlying domain model. Section 3.1 will intro- duce this domain model. Section 3.2 will briefly sketch the services. 3.1 Domain model The domain model represents knowledge on patient- individual formulations in a task-oriented way. That means that it does not aim to describe general knowledge on substances like all chemical, physical and galenic properties of substances. Instead, it focuses only on those characteristics that are relevant for the tasks of formula- tion. The model consists of four main parts:  substances: a task-oriented taxonomy of sub- stances, their properties, and relationships,  formulations: data on prescriptions and formula- tions including the routes of application and the wrappings,  master data of patients, health professionals, and health insurances, including administrative as well as medical information like history data on a patient’s prescriptions,  system administration data like user roles. In the following, we will present some details on the substances as this part of the model has the most impor- tant impact on providing assistance for the formulation tasks. The core of the substances model is a taxonomy of sub- stances, groups of substances, and prefabricated sub- stances. Figure 3 depicts the section of this taxonomy for the prefabricated substance ‘Volon A’. It consists of three ingredients: the medical agent Triamcinolon with 0.01 % (g/g), Polyethylen, and Paraffin, which belongs to the group of Alkanes. Fig. 3: Sample section of the substances model describing Volon A. Synonym names are not depicted in Figure 3 due to space limitations. They are integrated as objects of the class Synonym that are related to the abstract term, for instance ‘Volon A::Synonym’ and ‘VolonA::Synonym’ related to ‘VOLON_A::Substance’. Check criteria are represented by objects of the class ‘CheckCriterion’ encapsulating terms in predicate logics. For instance, the term Phenolic(x)  (Cyclohex- ene(y)  Benzol (y)) stands for the following sentence describing an incompatibility: “Phenolic sub- stances might react with Benzol or Cyclohexenes produc- ing unwanted phenolic by-products”. Atomic formulas with the arity one (one predicate about one term) stand for 263 check criteria on properties of single substances within a mixture like ‘AppropriateConcentration(x)’. 3.2 Services The domain model that we described in the previous sub- section is used by four services at the moment: a service build data model, a formulation service, a service main- tain system administration and master data, and a service maintain substances. In the following, we will have a closer look at the formulation service, which is called when a new formulation is to be developed. First, the acquisition of the initial formulation has to be done, and then the validation and adaptation of the formulation takes place. The acquisition starts with building a formulation ob- ject and performing a depth first search on the positions of the prescription object to create new position objects for the formulation object. The already existing position ob- jects of the prescription are copied and not overwritten for reasons of documentation. The result is a tree of all sub- stances that are involved in the formulation. Then, the specification of quantities is completed and unified. After the choice of either mass or volume specification, a breadth first search through the tree of substances is ex- ecuted in order to unify the units of measurement and to complete the missing units. If quantities cannot be derived automatically, the pharmacist is involved for interactively estimating quantity values. The order of dealing with the quantity values is important: In each node of the tree, the mass and volume specifications are handled before ex- panding the proportion specifications. A special treatment is required if an ‘ad’ occurs. First, the overall quantity of the mixture is determined from the ‘ad’ position, second, the quantities of the other positions is computed according to the order mentioned above, and last, the quantity of the ‘ad’ position is derived. The validation consists of the retrieval and review of relevant check criteria. It begins with computing the con- centration of all substances and storing them in a list. Then, the relevant check criteria are retrieved and tested automatically. In the prototypical implementation, a linear search is employed as retrieval method. This impacts a high computational complexity, which is to be optimized in future. The adaptation of the formulation when a hurt criterion has been detected is supported by case-based reasoning [Richter, 1998]. A case consists of a hurt criterion in any formulation (the problem part) and an alternative set of substances or additional preparation directions (the solu- tion part). In Figure 2, the uncommitted composed formu- lation together with the four problem items form the prob- lem part of a case while the alternative positions and the directions describe the solution part. A case base consists of a set of such cases recording pharmaceutical expe- rience. If a new problem occurs that requires the adapta- tion of a formulation, the best matching case is retrieved from the case base in order to reuse its solution. The re- trieval is based on a standard similarity function [Berg- mann, 2002] computing a weighted sum of local similarity values based on an internal, structural representation of the problem part of the cases. When building the internal representation of the cases, the domain model is applied to create attribute-value pairs from the initial formulation positions. The representation is created by the same pre- processing algorithm that is applied during the processing of a new recipe for the acquisition of an initial formula- tion. The user decides after the retrieval of the best match- ing case whether the solution is applicable to the new problem and transfers it to the new formulation where appropriate. 4. State of implementation and evaluation The acquisition of initial formulation, the retrieval, and the review of check criteria is fully implemented. The mending of the hurt criteria by means of case-based rea- soning is ongoing work. A field test has been conducted with three German phar- macies to evaluate the already implemented parts of the approach. As all German pharmacies underlie the same accounting mechanism with the health insurance compa- nies and the major part of them makes use of the electron- ic sources described in Section 2, the evaluation results can be considered representative for Germany. The field test investigated two research questions: The first question is whether the domain model including the check criteria is appropriate for the pharmaceutical knowledge that is required for the task of creating patient-individual formu- lations. The second question is whether the implementa- tion is usable. The first question has been investigated by an expert review of the domain model. The second ques- tion has been investigated by one of the pharmacists in- volved by means of working with the implemented mod- eling tool. The results for the first question are quite promising: The model review has shown that it covers the pharmaceutical knowledge including the check criteria to a great extend. Only the potentially heterogeneous granu- larity of check criteria was considered to cause problems in future. The modeling activities concerning the second research question led to 61 substances and groups of sub- stances, 18 check criteria derived from 11 genuine reci- pes. It turned out that the tool worked in principle well but that the computational performance of the tool should be improved in future. 5. Discussion of related work and outlook In this paper, we have presented a knowledge and expe- rience management approach that supports pharmacists in elaborating patient-individual formulations. A domain model of substances including their properties and rela- tionships is used to check potentially problematic proper- ties and relationships for a certain mixture. A case-based approach provides assistance for the adaptation of formu- lations when check criteria have been hurt. The literature reports a case-based approach using deci- sion trees to guide tablet formulation [CWR98]. In con- trast to our work, this approach addresses the formulation for the industrial production of drugs. In our patient- individual approach, aspects like tablet weight and yield pressure do not play any role. Furthermore, there is a wider variety of prescriptions in pharmacies than in an industrial tablet production. We think that considering particular check criteria is more feasible for this applica- tion area than using complex decision trees also with respect to maintenance issues. In our future work, we aim at finishing the implementa- tion and evaluating the case-based support for the valida- tion and adaptation process. 264 Acknowledgments The authors acknowledge the contributions of Thomas Jodes, Christian Schlimbach, Stefan Schölzel, and Mi- chael Christmann who made valuable contributions to the project by preparing their diploma theses and implement- ing the prototype. References [Bergmann, 2002] Bergmann, R.: Experience Manage- ment. LNAI 2432, Springer, 2002. [Craw et al., 1998] Craw, S.; Wiratunga, N.; Rowe, R.: Case-based design for tablet formulation. In Advances in Case-Based Reasoning, Proceedings of the 4th Eu- ropean Workshop on Case-Based Reasoning, LNAI 1488, Springer, pp. 358-369, 1998. [Eifler-Bollen et al., 2003] Eifler-Bollen, R.; Lein, A.; Reimann, H.: Qualität von Rezepturen steuern. Re- trieved October 30, 2008 from the Web http://www.pharmazeutische- zeitung.de/fileadmin/pza/2003-46/titel.htm. [Richter, 1998] Richter, M. M.: Introduction. In (Lenz, M.; Bartsch-Spörl, B., Burkhard, H.-D., Wess, S., Eds.): Case-Based Reasoning Technology. LNAI 1400, Springer, pp. 1-15, 1998. 265 Resubmission: Taking OWL to Athens: Semantic Web Technology takes Ancient Greek History to Students Jochen Reutelshoefer1, Florian Lemmerich1, Joachim Baumeister1, Jorit Wintjes2, Lorenz Haas2 1. Institute of Computer Science, University of Wu¨rzburg, Germany {lastname}@informatik.uni-wuerzburg.de 2. Institute of Ancient History, University of Wu¨rzburg, Germany {firstname.lastname}@uni-wuerzburg.de Abstract The HermesWiki project is a semantic wiki ap- plication on Ancient Greek History. As an e- learning platform, it aims at providing students effective access to concise and reliable domain knowledge, that is especially important for exam preparation. In this paper, we show how se- mantic technologies introduce new methods of learning by supporting teachers in the creation of contents and students in the personalized iden- tification of required knowledge. Therefore, we give an overview of the project and characterize the semi-formalized content. Additionally, we present several use cases and describe the seman- tic web techniques that are used to support the application. Furthermore, we report on the user experiences regarding the usefulness and appli- cability of semantic technologies in this context. 1 Introduction Students today are used to collect a large amount of re- quired knowledge from the internet. While the number of available webpages is huge for almost any given topic, the quality of these pages is quite heterogeneous. Additionally, it is very difficult for students, especially undergraduate students, to extract the essential knowledge from extensive webpages. Therefore, students can significantly benefit from a reliable and concise knowledge base, that combines the advantages of traditional text books with the advantages of internet-based knowledge platforms, like accessibility, integration of multimedia resources, and the ability to up- date the contents easily. However, the creation of such a web application still is a tedious task, as texts and resources must be created, collected, and adapted to the context. Ad- ditionally, with the growing amount of content it again gets more difficult for students to find exactly the resources they require. In this paper, we describe how semantic technolo- gies help to alleviate both problems in an e-learning project in the domain of Ancient Greek History, the HermesWiki. The HermesWiki is a semantic wiki, developed to give students of history an introductory domain overview. As a full featured wiki system, HermesWiki allows for an effective collaborative composition of content. Further- more, semantic technologies provide advanced functional- ity that generates additional value to the learning environ- ment: Formalized paragraphs and semantic annotations of the main text enable the contributors to easily generate ad- ditional content tailored to different tasks, such as tables of events, maps, or a quiz. Further, students have access to different forms of knowledge like narrative elements, an- cient source texts, geo data, chronologic information and images. They can use semantic navigation to easily find the concepts they are looking for, filter the presented content by its relevancy on demand, and query the systems knowl- edge using an intuitive user interface. Figure 1 shows the essay about Tyrannis. It includes a short list of the most important events generated from the ontology, an introduc- tory section, that explains the context of this essay, a text description of the respective topic, an overview of ancient historical sources, and a formalized set of relevant events. Figure 1: The top (a) and bottom (b) of the wiki page containing the essay about Tyrannis taken from the Her- mesWiki. (Screenshots of the system, in German language) 2 Resubmission This is a resubmission of work that has originally been submitted to the 7th Extended Semantic Web Conference (ESWC2010) [Reutelshoefer et al., 2010]. References [Reutelshoefer et al., 2010] Jochen Reutelshoefer, Florian Lemmerich, Joachim Baumeister, Jorit Wintjes, and Lorenz Haas. Taking OWL to Athens – Semantic Web Technology takes Ancient Greek History to Students. In ESWC’10: Proceedings of the 7th Extended Semantic Web Conference. Springer, 2010. 267 INTEGRATION OF LINKED OPEN DATA IN CASE-BASED REASONING SYSTEMS Christian Severin Sauer1 Kerstin Bach2 Klaus-Dieter Althoff2 University of Hildesheim, Dep. of Computer Science Intelligent Information Systems Lab D-31141, Hildesheim, Germany 1christiansauer@gmail.com, 2{lastname}@iis.uni-hildesheim.de Abstract This paper discusses the opportunities of inte- grating Linked Open Data (LOD) resources into Case-Based Reasoning (CBR) systems. Upon the application domain travel medicine, we will exemplify how LOD can be used to fill three out of four knowledge containers a CBR system is based on. The paper also presents the applied techniques for the realization and demonstrates the performance gain of knowledge acquisition by the use of LOD. 1 Introduction The recent developments in the field of Linked Open Data (LOD) further added value to the already existing vast amount of structured information that is available through the use of LOD [Bizer et al., 2008]. The availability of such an amount of structured information suggests apply- ing LOD to the semi-automatic generation of knowledge. In the context of our work knowledge represents the data that is organized in the knowledge containers of a Case- Based Reasoning (CBR) system. Each CBR system uses the four knowledge containers vocabulary, similarity mea- sures, transformational knowledge and cases. The work presented in this paper focuses on filling the vocabulary and similarity measure containers as well as generating cases from LOD. Therefor, we demonstrate some possibilities of filling the aforementioned knowledge containers using simplified examples. These examples are preliminary, thus cannot be used in a real-world CBR system, because they are incomplete. The linked data is provided for free and contains com- prehensive knowledge where each term is assigned to at least one category and thus being represented in a specific context. The latest development in the field of LOD can be seen in the development of such complex knowledge repositories as for example given by the DBpedia ontol- ogy1. DBpedia contains many terms originating from the on-line encyclopedia Wikipedia. These terms are organized in an ontology and are being enriched with further infor- mation such as different labels in various languages. Cur- rently, the DBpedia ontology contains 1,478,000 instances, i.e. Table 1. Knowledge acquisition is still the bottleneck within the development of CBR systems. Our approach aims at au- tomatization and for this purpose we propose a schema that extracts knowledge from LOD sources and provides and transforms it for CBR systems. 1 http://dbpedia.org/ontology/ class instances Place 413,000 Person 312,000 Work 320,000 Species 146,000 Organisation 140,000 Building 33,000 ... ... Diseases 4,600 Table 1: Instances of DBpedia The remaining sections of this paper are structured as follows: In section 2 we describe the application domain travel medicine, especially the docQuery project in which the experiments are carried out, followed by section 3 that describes how knowledge can be integrated in CBR sys- tems. Section 4 presents the implementation details of our approach. The following section 5 presents experimental results of our approach before we sum up the paper and give an outlook on future work in the final section 7. 2 Application Domain: Travel Medicine Travel medicine is an interdisciplinary specialty concerned with the prevention, management and research of health problems associated with travel, and covers all medical as- pects a traveler has to take care of before, during and after a journey. For that reason it covers many medical areas and combines them with further information about the destina- tion, the activities planned and additional conditions which also have to be considered when giving medical advice to a traveler. Travel medicine starts when a person moves from one place to another by any mode of transportation and stops after returning home without diseases or infections. The realization of the travel medicine project docQuery is based on the SEASALT (Sharing Experience using an Agent-based System Architecture LayouT) architecture [Reichle et al., 2009] that is especially suited for the acqui- sition, handling and provision of experiential knowledge as it is provided by communities of practice and represented within Web 2.0 platforms. It is based on the Collabora- tive Multi-Expert-Systems (CoMES), approach presented by Althoff et. al. [Althoff et al., 2007], a continuation of combining established artificial intelligence techniques and the application of the product line concept (known from software engineering) creating knowledge lines. According to the SEASALT architecture, docQuery con- sists of eight different CBR systems. Each CBR system within docQuery equips a software agent that represents a 269 certain topic. Further the multi-agent-system is aggregat- ing a composed result. However, within this paper we only focus on one CBR systems that contains information about diseases that might somebody can get infected with during a journey. 3 CBR System CBR, the episodical knowledge, especially made experi- ences and successfully applied solutions for problems draw a huge potential for the development of an Experience Web [Plaza, 2008], because the experiences of many WWW users can be captured in case bases and provided in CBR systems. Because of the fact that CBR systems are based on experiences it is much easier to use them to handle ex- periences in comparison to systems that are based on tech- nologies from the areas of information retrieval or seman- tic web. Experiences in CBR systems occur in all types of knowledge: mostly in cases, but also similarity mea- sures, vocabulary and transformational knowledge can be derived. Huge amounts of raw data for the knowledge ex- traction is available on the web communities and the chal- lenge is making these experiences available. According to Richter [Richter, 1998], the knowledge of CBR systems can be provided in all four knowledge con- tainers that include vocabulary, similarity measures, trans- formational (or adaptation) knowledge and cases. We fo- cus on how the knowledge containers can be filled using the experiences provided in web communities. Therefore we have to extract the knowledge before it can be stored. Further on, we deal with data sources that are mostly free text what usually requires a symbolic representation of keywords. That is the reason why we currently focus on the extraction of taxonomies that can be used for both, enhancing the vocabulary and assigning the similarity. 4 Implementation Within this paper we present how knowledge about dis- eases can be extracted from LOD and integrated in a CBR system. Our CBR system has been developed using the open source tool myCBR [Stahl and Roth-Berghofer, 2008], thus all results produced by our application are com- patible with the myCBR case representation and thus can be applied in myCBR-based applications. In case of the generation of taxonomies that contains knowledge about the similarity between objects represented in the taxonomy, we use the Knowledge Extraction Workbench (KEWo) to further refine the initially provided data from LOD. A more detailed description of KEWo can be found in [Bach et al., 2010b] and [Sauer, 2010]. As the first step we retrieved data about diseases from the DBpedia ontology. The result of such a retrieval can be seen in figure 1. In this case we queried for all available diseases in the ontology and their German labels. The fig- ure shows the shortened result containing the URI and the according label. Furthermore, we queried the DBpedia ontology in order to extend our information retrieval by using the Resource Description Framework Schema (RDFS) to retrieve labels of diseases in different languages. Also, we retrieved the Simple Knowledge Organization System based (SKOS) in- formation about the categories a disease belongs to, e.g. that a ”bite” is a member of the category injuries. Analo- gous to the retrieval of German labels of different diseases we also retrieved the according German labels of these Name label of the disease Viral disease true or false Infectious disease true or false Bacterial disease true or false Injury true or false Foodborne illness true or false Inflammation true or false Parasitic disease true or false Table 2: Disease categories (derived from DBPedia) categories. The technique used for these initial retrieval steps was a set of SPARQL queries conducted via the open source Desktop-SPARQL-Query tool ”Twinkle”2. Listing 4 shows an example for a query retrieving all diseases from the DBpedia ontology together with the categories the are part of filtered in that way, that only the German labels of the diseases and the categories are returned. Listing 1: Example Query PREFIX r d f s : PREFIX skos : SELECT ? d i s e a s e ? c a t e g o r i e s ? d i s l a b e l s ? c a t l a b e l s WHERE { ? d i s e a s e a . ? d i s e a s e r d f s : l a b e l ? d i s l a b e l s . ? d i s e a s e skos : s u b j e c t ? c a t e g o r i e s . FILTER ( l a n g ( ? d i s l a b e l s ) = ” de ” ) ? c a t e g o r i e s r d f s : l a b e l ? c a t l a b e l s . FILTER ( l a n g ( ? c a t l a b e l s ) = ” de ” ) } This still simple query, compared to the many sophisti- cated possibilities SPARQL offers to retrieve complex in- formations about objects and their relations, yielded 2004 category-disease-pairs with their corresponding German la- bels. Also some of the Items in the disease ontology are of a broader nature like the item Abuse we found enough spe- cific Data to try to incorporate the retrieved Data into our CBR-system. The simplest task was the extraction of data for the knowledge container vocabulary. For this task we simply derived the labels of the diseases from the retrieved data and thus formed a basic vocabulary of diseases for our CBR system. This vocabulary contains 2000 diseases with their German labels. The next knowledge container we wanted to fill with Data from the LOD retrieval were the structured cases of our structured CBR system. For this purpose we chose six Categories from the English category-disease-pairs we as- sumed to be of interest for a case as attributes. The at- tributes that are describing a disease case are the follow- ing categories: Viral disease, Infectious disease, Bacte- rial disease, Injury, Foodborne illness, Inflammation, and Parasitic disease. Table 2 describes the structure of a disease case. The cases in our application scenario for this paper that are cre- ated based on LOD, of course, cannot serve as full case in a real-life application domain (like docQuery). However, LOD provides the potential data to create such cases. To derive such cases from the retrieved data we imple- mented a java-based tool to extract the disease and category labels from the data and check if a disease label was part of a category-disease-pairs in which the category was, or was not one of the categories we chose to be the attributes 2 http://www.ldodds.com/projects/twinkle/ 270 Figure 1: DBpedia retrieval result for disease (incomplete) of our disease cases. After this extraction and comparison the resulting data was transformed into the csv-format to enable it to be imported as cases by the myCBR plugin we used in our Prote´ge´ development environment3. The cre- ated csv-file then was imported into Prote´ge´ resulting into the generation of 612 disease cases. The third knowledge container we aimed at was the sim- ilarity measure. In our current project docQuery [Bach et al., 2010a] we use taxonomies of items e.g. diseases, lo- cations or medicaments to encode the similarity of these items into the distance of two items within the taxonomy [Bergmann, 2002]. So our goal was it to construct the according taxonomy from the retrieved LOD. This taxon- omy, representing the similarity of diseases, was built us- ing again a small java-based tool we implemented and one of our recently developed tools known as the Knowledge Extraction Workbench (KEWo). The process of taxonomy generation was thus the following: We extracted the dis- eases and the categories to which the diseases belong to as described above by deriving this information via a sim- ple Java tool from the raw retrieval data. This process was again carried out on the German data set we retrieved. The resulting data was, due to the techniques used for taxon- omy generation by the KEWo, further formated in a special way. The special formating listed the category twice fol- lowed by the disease. This resulted into a line describing a category-disease-pair looking as follows: Mykose (cate- gory) – Mykose (category) – Kokzidioidomykose(disease). This special formating is owed to the numerical approach the analysis methods of the KEWo employ. More details on the analysis with a brief discussion can be found in [Bach et al., 2010b]. Our tool KEWo was created as an knowledge extraction tool operating on an on-line forum of travel medicine ex- perts. Its goal is, in short, to derive medical knowledge from this forum employing various NLP and Information Extraction techniques on the content of the forum. For fur- ther details regarding the exact process model and principle of operation of the KEWo please see [Bach et al., 2010b]. As a workaround to employ our KEWo for taxon- omy generation from the retrieved LOD, the generated text,containing all category-disease-pairs was then inserted in the expert forum connected to the KEWo as a posting. Thus the KEWo was able to process the text containing the information from the LOD and so build a taxonomy of dis- eases. We took the first 1000 category-disease-pairs and pro- cessed them in the described way, receiving a taxonomy describing 116 diseases. Figure 2 shows a snippet from the generated taxonomy. Such a taxonomy can be used in a 3 http://protege.stanford.edu CBR system as both, a source for vocabulary and the as- signment of a similarity value between two concepts. Figure 2: Snippet of the generated taxonomy For the generation of the taxonomy we used the retrieved data about diseases and the according categories and con- 271 ducted a statistical analysis regarding the occurrences of the different diseases and categories. In a later develop- ment it is, of course, desirable to integrate ontological data available in LOD directly into a similarity measure. How- ever, our approach at this point is of a more basic nature in order to point out the possibility of simple data integration from LOD into a similarity measure. It is obvious to ask the question why there are not more diseases in the taxonomy, given the fact that we used 1000 category-disease-pairs. The comparatively low amount of diseases showing up in the generated taxonomy is partly caused by a certain kind of ’missuse’ of our own Tool KEWo which is optimized for analyzing natural language and not such highly structured text as given to it in this ex- periment, please see Section 5 [ref Section 5] for further details on this ’conflict’. Nevertheless we were in fact able to produce a taxonomy of diseases from the retrieved LOD and in the upper ranges of its structure the linking of the disease items are of satisfying quality. 5 Experimental Results The goal of our experiments was it to retrieve LOD about diseases and incorporate the retrieved data into 3 out of the 4 knowledge-containers of a CBR-system. These three knowledge containers were the vocabulary, the cases and the similarity measure of a CBR system used in our project docQuery which aims for on-line knowledge provision in the field of travel medicine. Our SPARQL query to the DBpedia ontology re- turned 2004 unique English disease-labels. Further queries returned 2000 German disease-labels, 2000 En- glish category-disease-pairs in which a category-disease- pair represents a disease-label and a category it is assigned to in SKOS categories, e.g. AIDS : viral diseases and also 2004 German category-disease-pairs. For the first knowledge container, vocabulary we were able to extract all of the either English or German dis- ease labels resulting into disease vocabularies consisting of 2004 English respectively German terms for diseases. We conclude that, the small initial amount of time invested into the development of the Java tool for the extraction of terms from the raw retrieved LOD aside, the amount of time for building a vocabulary is drastically reduced, lit- erally to minutes. Once a tool is provided it is a matter of minutes to adapt the tool, run a new query, and extract the resulting terms from the retrieved LOD for any new vocab- ulary. For the second knowledge-container we were able to ex- tract 612 cases out of the English data retrieved. Each case consisted of a unique disease-name and six categories of diseases it either belonged to or not, please see Section 4 for the exact structure of the extracted cases. In the process of generating these cases we again used a small Java tool we developed to pre-process the raw retrieved LOD. Dur- ing the use of this tool we discovered that it is very easy to further adapt this Tool to generate any desired kind of csv-formated files representing cases and thus upon later importing these files via myCBR, easily generate any kind of structural cases from LOD, extracted according to the desired structure of the cases. Similar to the possibility to rapidly built new vocabularies this approach of building structural cases from LOD is also a very rapid approach. Given that a basic tool for the exact extraction from the retrieved LOD is already in place the generation of new cases consists of just a new query to an LOD repository, a slight adjustment of the extraction tool and a new import of the generated csv-file via myCBR. This process can, if the cases are not overly complex, be performed in minutes and yield hundreds of cases only limited in their numbers by the amount of available LOD. Furthermore, using LOD also reduces the costs of knowledge acquisition in terms of time, logistics, effort, money, etc. For the third knowledge container similarity measure we were able to build a taxonomy, carrying informations about its items similarity in the form of their distance within the taxonomy. The generated taxonomy contains 116 disease items an was built upon the input of 1000 category-disease- pairs. Top level structures in the generated taxonomy show a satisfying quality nevertheless in deeper levels of the tax- onomy the quality of disease item links e.g. making ’sense’ as father-child pairs of nodes deteriorates quickly. Also the amount of disease-items in the taxonomy is not satisfying. We assume this to be a result of a workaround we employed to generate the taxonomy. We used our recently developed tool KEWo which is specialized in processing natural lan- guage and information extraction out of unstructured text, generally discussion posts from an on-line forum. Feeding this tool with the highly structured text we derived from the LOD retrieval surely caused some conflicts with the origi- nal goal of the KEWo to extract information from unstruc- tured text. We took this into account as far as we were aware of this ’missuse’ and accepted the poorer results but thus getting at least a proof of concept taxonomy that shows that it is indeed possible to derive knowledge for the simi- larity measures from LOD. Also the process of generating a taxonomy from LOD is a bit more complex than the two other processes described in this paper it still is by far faster than a manual approach of building a taxonomy could not ever be. Despite the lack of quality regarding the linking of items in the deeper levels of the taxonomy, as a first proof of concept run, the generation of a similarity measure in form of a taxonomy can be seen as equally accelerated as the generation of the vocabulary and the cases are by the use of LOD. 6 Discussion During our work with LOD we found it somewhat hard to identify relevant LOD repositories and the specific names of the attributes respectively predicates of the items in these repositories. Also we noticed there are ongoing efforts to improve the searchability of LOD this still is an issue we found somewhat hampering the use of LOD despite its ease and resource fullness if a correct repository and all of its facets are identified. Nevertheless the rich and fast growing sources of LOD surely legitimate further research into their use as a source for knowledge to be used in the knowledge-containers of CBR. This is especially true if one takes into account their highly structured nature and the fact that LOD is available for free. As far as we can comment on our first experimental re- sults the use of LOD, once a working development envi- ronment consisting of querying tools, tools for further in- formation refinement out of raw LOD and tools for incor- porating this refined data into CBR knowledge containers like myCBR is present, filling the knowledge containers vocabulary, structured cases and similarity measure is an easy and fast process. The fourth knowledge container of CBR that is contain- ing adaptation knowledge, was not part of our experiments 272 and it is still questionable if there is a method to generate and/or extract knowledge for this container from LOD. We assume that there might be a chance to use the taxonomy we generated as a similarity measure to derive some adap- tation knowledge from it, like the model-based adaptation approach presented in [Hanft et al., 2010]. Furthermore, one can imagine that the structuring of much of the LOD as ontologies might yield some further sources for the ex- traction of adaptation knowledge from these ontologies. 7 Summary and Outlook In this paper we proposed the idea to use LOD as a source to fill three out of the four knowledge containers a CBR system is based on. The containers we addressed were vo- cabulary, strucutred cases and the similarity measure. We have described our experimental setup regarding the meth- ods used to acquire LOD in the field of diseases, the process of further refining the acquired data involving the develop- ment of some simple java-based tools and the use of more complex tools like our own developed KEWo and the ex- ternal tools myCBR and Prote´ge´ to transform the LOD into fitting the structural requirements of the aforementioned knowledge containers [Bergmann, 2002]. We were able to produce good quantity and quality re- sults for the knowledge containers vocabulary and struc- tural cases as well as the data extracted for assigning sim- ilarity measures in form of a taxonomy, build upon the re- trieved LOD, using our own tool KEWo is also acceptable. Further on, we were able to proof the fact that the approach of a semi-automatic acquisition of knowledge for the three mentioned knowledge containers from LOD resulted into a very rapid built up of these containers compared to man- ual or even other existing semi-automatic approaches we recently developed (see [Bach et al., 2010b] for details). Future goals based upon the work presented in this paper consist of the further refinement and automatization of the extraction process from LOD and the development of more efficient and even more easily customizable Tools for the Refinement of retrieved raw LOD. Also the question of how to extract adaptation knowledge should also be resolved to profit on the benefits of rapid knowledge acquisition for all four of the knowledge containers. A distant future goal is given by the idea of incorporat- ing all process-steps and tools presented in this paper into a single tool that handles the whole process from finding appropriate LOD resources, querying them, refining the re- trieved data and formating it to be further processed by our own tools, like the KEWo, or external tools like myCBR. All of these functionalities should be embedded in a con- venient GUI which allows the user to specify the knowl- edge container for which LOD should be retrieved for and the desired characteristics of this LOD, being most cus- tomizable in regard to the many facets each knowledge- containers contents can have. References [Althoff et al., 2007] Klaus-Dieter Althoff, Kerstin Bach, Jan-Oliver Deutsch, Alexandre Hanft, Jens Ma¨nz, Thomas Mu¨ller, Regis Newo, Meike Reichle, Martin Schaaf, and Karl-Heinz Weis. Collaborative Multi- Expert-Systems – Realizing Knowlegde-Product-Lines with Case Factories and Distributed Learning Systems. In Joachim Baumeister and Dietmar Seipel, editors, Workshop Proceedings on the 3rd Workshop on Knowl- edge Engineering and Software Engineering (KESE 2007), Osnabru¨ck, September 2007. [Bach et al., 2010a] Kerstin Bach, Meike Reichle, and Klaus-Dieter Althoff. Case-Based Reasoning in a Travel Medicine Application. In Isabelle Bichindaritz and Lakhmi Jain, editors, Computational Intelligence in Medicine, Advanced Information and Knowledge Pro- cessing, page to appear. Springer, 2010. [Bach et al., 2010b] Kerstin Bach, Christian Severin Sauer, and Klaus-Dieter Althoff. Deriving Case Base Vocabulary from Web Community Data. In Cindy Marling, editor, ICCBR-2010 Workshop Proceedings: Workshop on Reasonng From Experiences On The Web, page to appear, 2010. [Bergmann, 2002] Ralph Bergmann. Experience Man- agement: Foundations, Development Methodology, and Internet-Based Applications, volume 2432 of Lecture Notes in Computer Science. Springer, 2002. [Bizer et al., 2008] Christian Bizer, Tom Heath, Kingsley Idehen, and Tim Berners-Lee. Linked data on the web (LDOW2008). In WWW ’08: Proceeding of the 17th in- ternational conference on World Wide Web, pages 1265– 1266, New York, NY, USA, 2008. ACM. [Hanft et al., 2010] Alexandre Hanft, Re´gis Newo, Kerstin Bach, Norman Ihle, and Klaus-Dieter Althoff. CookIIS - A successful Recipe Advisor and Menu Advisor. In Stefania Montani and Lakhmi Jain, editors, Successful Case-based Reasoning applications. Springer, 2010. [Plaza, 2008] Enric Plaza. Semantics and Experience in the Future Web. In Klaus-Dieter Althoff, Ralph Bergmann, Mirjam Minor, and Alexandre Hanft, edi- tors, Advances in Case-Based Reasoning, 9th European Conference, ECCBR 2008, Trier, Germany, September 1-4, 2008. Proceedings, volume 5239 of Lecture Notes in Computer Science, pages 44–58. Springer, 2008. [Reichle et al., 2009] Meike Reichle, Kerstin Bach, and Klaus-Dieter Althoff. The SEASALT Architecture and its Realization within the docQuery Project. In Ba¨rbel Mertsching, editor, Proceedings of the 32nd Annual Conference on Artificial Intelligence (KI-2009), Lecture Notes in Informatics, pages 556–563, September 2009. [Richter, 1998] Michael M. Richter. Introduction. In Mario Lenz, Brigitte Bartsch-Spo¨rl, Hans-Dieter Burkhard, and Stefan Wess, editors, Case-Based Rea- soning Technology – From Foundations to Applications, LNAI 1400. Springer-Verlag, Berlin, 1998. [Sauer, 2010] Christian Severin Sauer. Analyse von Web- communities und Extraktion von Wissen aus Commu- nitydaten fu¨r Case-Based Reasoning Systeme. Mas- ter’s thesis, Institute of Computer Science, University of Hildesheim, 2010. [Stahl and Roth-Berghofer, 2008] Armin Stahl and Thomas R. Roth-Berghofer. Rapid Prototyping of CBR Applications with the Open Source Tool myCBR. In ECCBR’08: Proceedings of the 9th European confer- ence on Advances in Case-Based Reasoning, pages 615–629, Berlin, Heidelberg, 2008. Springer-Verlag. 273 Abstract Die Herausforderung, sich in der beständig zu- nehmenden Information zurechtzufinden, ist nicht neu. Durch verschiedene Applikationen - insbesondere des Web 2.0 - sind Lösungen ent- standen, die einen Benutzer dabei unterstützen, einen Teil der für ihn relevanten Information zu verwalten und somit Wissen zu generieren. Al- lerdings bleiben die jeweiligen Informationen auf die jeweilige Applikation beschränkt. Mit Memex360 - einem Ergebnis des Use Cases ORDO im Forschungsprogramm THESEUS - wird eine integrative Schicht auf die Informatio- nen implementiert, wobei die jeweiligen Werk- zeuge weiter genutzt werden können. In dem Beitrag wird die Idee und Architektur von Memex360 vorgestellt. 1. Einleitung Die Herausforderung, sich in der beständig zunehmenden Menge an Information zurechtzufinden, ist nicht neu. Bereits 1945 veröffentlichte Vanevar Bush die Idee des MEMEX [Bush, 1945]. Dieser MEMEX (memory extender) unterstützt einen Wissensarbeiter dabei, die für ihn relevante Information zu erfassen, zu annotieren und miteinander in Bezug zu setzen. Grundidee dabei war es, den Menschen bei der Verwaltung und Erschließung sei- nes Wissens zu unterstützen – nicht den Menschen zu ersetzen. Eine Vielzahl von Web 2.0 Applikationen ermöglicht es heute, Informationen zu erfassen, zu annotieren und mit einander in Bezug zu setzen stehen. Blogs erlauben es, persönliche Notizen zu erfassen. Soziale Netzwerke hel- fen dabei, passende Ansprechpartner zu finden. Und nicht zuletzt Suchmaschinen – sei es auf dem Desktop oder im Web – sind in der Lage, umfangreiche Datenbestände schnell zu erschließen. Allerdings stehen diese Lösungen in der Regel für sich allein. Erschließt sich ein Nutzer ein neues Wissensgebiet, kann er zum Beispiel eine Notiz zu einem Dokument in einem Blogbeitrag anlegen, oder in einer Annotation in dem jeweiligen Dokument oder der Notiz-Funktion des Bookmark-Dienstes. Welche Form der Annotation genutzt wird, hängt von den Vorlieben des Benutzers und seiner Situation ab – die Notiz bei seinem bevorzugtem Book- mark-Dienst wird z.B. unterwegs benutzt, während die Notiz in einem Dokument genutzt wird, wenn er keine Internet-Verbindung hat. Diese Annotationen stehen also nur in der jeweiligen Applikation zur Verfügung – die jeweilige Information bleibt also isoliert. Doch nicht nur die Informationen in den Dokumenten bleiben isoliert. Auch das Wissensmodell, mit denen sich ein Benutzer Dokumente erschließt und klassifiziert, lie- gen in der Regel nur in der jeweiligen Applikation vor. Die in einem Personal Information-Management (PIM)- Werkzeug wie Outlook angelegten Kategorien liegen nur hier vor – um damit Blog-Einträge oder Bookmarks zu kategorisieren, müssten die Kategorien in der jeweiligen Applikation manuell gepflegt werden. Aufgrund des damit verbundenen hohen Aufwandes unterbleibt dabei meistens ein Abgleich der Kategorien. Zusätzlich hat ein Nutzer in den verschiedenen Appli- kationen Informationen, welche die Erschließung von Dokumenten weiter unterstützen können. Beispielsweise können die Kontaktdaten im PIM oder in sozialen Netz- werken genutzt werden, um Firmennamen und Personen in relevanten Dokumenten zu identifizieren. Die technische Grundlage, um diese Informationen mit- tels einer integrativen Schicht miteinander in Bezug zu setzen, liegt in den technischen Standards, mit denen die Inhalte in den jeweiligen Applikationen angesprochen werden können. Im Bereich der Blogs haben sich zum Beispiel die Blogger API [Winer, 2002] und die Meta- Weblog API [Anonymous] etabliert, um sowohl Inhalte und Kategorien abzufragen als auch zu erstellen. Durch RSS [Winer, 2003] Benachrichtigungen existiert ein Me- chanismus, um Änderungen in einer Applikation an ande- re Applikationen oder den Benutzer zu kommunizieren. So können aktuelle Änderungen zeitnah von Suchdiensten in einer maschinell verarbeitbaren Form erfasst werden. Memex360 implementiert diese integrative Schicht, welche die von einem Benutzer verwendeten Quellen integriert, miteinander in Bezug setzt und die Ergebnisse an den Benutzer meldet. 2. Memex360 – Idee und Architektur Der Ausgangspunkt von Memex360 ist, dass der Benutzer seine bisher verwendeten Werkzeuge zum persönlichen Wissensmanagement weiterhin verwenden kann. Memex360 sorgt für die entsprechende Integration der unterschiedlichen Applikationen. Die entsprechende Ar- chitektur ist in der folgenden Abbildung dargestellt. Die obere Schicht (Mashups und Browser Integration) deckt mögliche Werkzeuge zum persönlichen Wissensmanage- ment ab. Von links nach rechts sind dies: • PIM-Werkzeuge und RSS-Feed-Aggregatoren, um dem Benutzer beim Selbstmanagement zu unter- stützen und über Neuigkeiten auf dem laufenden zu halten. Memex360 – Persönliches Wissensmanagement mit Theseus ORDO Björn Decker Ralph Traphöner Attensity Europe GmbH Europaallee 10, 67657 Kaiserslautern, Germany bjoern.decker@attensity.com ralph.traphoener@attensity.com 275 • Bookmarks-Verwaltungs-Werkzeuge, um Referen- zen auf relevante Dokumente zu erfassen. • Blogging oder Wikis Tools, um persönliche Noti- zen zu erfassen • Visualisierungs-Tools, z. B. Mindmaps, um Infor- mationen miteinander in Bezug zu setzen. Mittels standardisierter Zugriffs-Mechanismen werden zum Einen die Inhalte dieser Applikationen durch die Service-Plattform erfasst und aufbereitet. Zum Anderen werden die Inhalte – wie ähnliche Dokumente oder Klas- sifikationen - in der Service-Plattform den verschiedenen Applikationen zur Verfügung gestellt. Die Service- Plattform hat dabei die folgenden vier grundlegenden Dienste: • Der Such-Service indiziert zum Einen die relevan- ten Dokumente und reichert sie mit Metadaten an. Zum Anderen stellt dieser Service eine föderierte Suche zur Verfügung, um weitere Informations- quellen – durch Wolken angedeutet - zu integrie- ren. Neu hinzugekommene Informationen werden mittels konfigurierbarer RSS Feeds an den Nutzer weitergeleitet. • Der Referenz-Dienst verwaltet die Referenzen des Benutzers auf relevante Dokumente. Weiterhin lei- tet der Referenz-Dienst neue Referenzen an den Such-Service zur weiteren Bearbeitung weiter. • Der Zettelkasten-Dienst erfasst persönliche Noti- zen und setzt sie mit den Dokumenten aus dem Re- ferenz-Dienst in Bezug. • Den Ontologie-Dienst, um das persönliche Wis- sensmodell des Benutzers zu pflegen und den an- deren Diensten zur Verfügung zu stellen. Die On- tologie wird entweder durch Hinzunahme von Metadaten in den Applikationen (z.B. neue Kate- gorien) oder dem Visualisierungs-Werkzeug ge- pflegt. Vorschläge für Erweiterungen der Ontolo- gie können dabei durch die Analyse des Such- Services erstellt werden. 3. Die Erstellung einer Marktanalyse – Ein Beispiel für die Nutzung von Memex360 Herr X ist mit der Erstellung einer Marktanalyse zum Thema „Capsaicin“ beauftragt – einem Bestandteil des Chillis, welcher z.B. in der Schmerztherapie genutzt wird. Dazu nutzt er die Mindmap, welche er zu einem ähnlichen Thema bereits angelegt hatte. Ausgangspunkt sind die Fragen der bereits durchgeführten Marktanalyse zu Defi- nitionen, ähnlichen Wirkstoffen, Studien und Anbieter. Die entsprechenden bereits angelegten Kategorien und Klassen in der Ontologie kann er wiederverwenden. Die neuen Einträge werden darüber hinaus mit der aktuellen Aufgabenstellung als Arbeitskontext verschlagwortet. Herr X sucht nun nach entsprechenden Begriffen – und richtet gleichzeitig entsprechende Benachrichtigungen auf dem Such-Service ein. So wird er später über Änderungen informiert. Relevante Treffer merkt er sich im Referenz- dienst und fertig eigene Notizen an. Via RSS stehen diese Notizen und Referenzen auch im Mindmap Tool zur Ver- fügung und können hier mit anderen Ergebnissen seiner Abbildung 1: Memex360 Architekturskizze 276 Recherche in Bezug gesetzt werden. Bei der Marktübersicht erstellt er unter anderem einen Überblick über Forschungsinstitute und Herstellerfirmen – jeweils von Herrn X angelegten Unterklassen von Anbie- ter. Vorschläge hierzu werden aufgrund der linguistischen Analyse der gefundenen Objekte – oder bereits vorhande- nen Metadaten – unterbreitet. Bestandteil der o.g. Klassen ist dabei eine Referenz auf eine Instanz der Klasse Ort. Auch hierfür können für den Anbieter entsprechende Vorschläge generiert werden. Die Ortsangabe wiederum kann genutzt werden, um über Open Linked Data (z. B. geonames.org) weitere Informa- tionen zu ermitteln. So können die Geokoordinaten hinzu- gefügt werden, um die räumliche Verteilung der Anbieter zu analysieren. Während der Recherche verfeinert Herr X auch an an- deren Stellen sein Wissensmodell. Er findet z. B. immer wiederkehrende Personen, welche er als wichtige Autoren erkennt. Diese kann er z.B. in seinem Adressbuch ablegen und so in auch in die Ontologie einfügen. Auch richtet er neue Unterkategorien ein. Diese Erweiterungen können dann genutzt werden, um mit den bestehenden Anfragen weitere relevante Dokumente zu suchen und zu identifi- zieren. Durch die zunehmende Anreicherung der Mindmap können darüber hinaus weitere, ähnliche Einträge in ande- ren Mindmaps identifiziert und zur Integration in die aktuelle Mindmap vorgeschlagen werden. 4. Prototypische Implementierung Die Implementierung des Memex360 Prototyps hat den Status eines Proof of Concept. Dabei wird der Suchdienst durch eine CBR1 Alle Informationsobjekte, seien es Webseiten oder ei- gene Notizen und Dokumente, werden als strukturierte Fälle repräsentiert. Die erforderlichen Metadaten werden durch die Verschlagwortung des Anwenders und Ontolo- gie basierte Informationsextraktion aus dem Inhalt des Informationsobjekts gewonnen. Letzteres erfolgt mit einer auf linguistischen Regeln basierenden Text Mining Engi- ne (TME), die auch im Rahmen von THESEUS ORDO entwickelt wird. Komponente zur Verfügung gestellt, d. h. es handelt sich im Kern um eine strukturierte fallbasier- te Anwendung. Das Schema – die Ontologie – der strukturierten CBR Engine kann domänenspezifisch modelliert werden. Dabei ist es weitgehend unerheblich welcher Formalismus ver- wendet wird. Weitgehend in dem Sinne, das die genutzte IAS objektorientierte Modelle nutzt, die eine extensionale Klassendefinition voraussetzen. Eine bijektive Abbildung in OWL ist nicht möglich, da nicht alle OWL Konstrukte unterstützt werden. SKOS und RDFS Modelle sind abbildbar und Modelle können importiert werden. Modellierung durch den Anwender bezieht sich dabei immer auf Instanzen. Schema, d. h. Klassen, können nicht zur Laufzeit bearbeitet werden. Bewusst wurde Memex360 so ausgelegt, dass die Be- nutzung nicht an dedizierte Clients gebunden ist. Viel- mehr bettet sich Memex360 nahezu unsichtbar in die bestehende Arbeitsumgebung ein: • Referenzen und Bookmarks können mit allen gän- gigen Tools verwaltet werden, soweit diese ihre 1 Case-Based Reasoning Inhalte als RSS Feeds zur Verfügung stellen. Die Autoren nutzen Delicious2 und Zotero3 • Neue Bookmarks werden als solche erkannt und automatisch durch Crawler oder RSS Abonnement zur Fallbasis hinzugefügt. . • Die direkte Erstellung eigener Inhalte und Notizen kann mit Blogwerkzeugen erfolgen. Hierzu wurde Memex360 um die gängigen Blog API Protokolle erweitert. Damit wird z. B. Windows Live Writer zu einem komfortablen Editor für die eigene Fall- basis. • Die Verschlagwortung von Inhalten erfolgt jeweils durch die nativen Tagging-Mechanismen der vor- genannten Werkzeuge. Wird dabei in z.B. Delicious ein neues Tag vergeben, so wird es in die Ontologie aufgenommen, steht dann im Live Wri- ter als Schlagwort zur Verfügung und kann auch dort in den Kategoriebaum eingeordnet werden, der wiederum von der CBR-Engine als Ähnlich- keitsmaß interpretiert wird. • Die Suche nach Inhalten erfolgt mit Hilfe des Open Search Interface direkt aus der Suchbox des Brow- sers. Dort wird Memex360 genauso als Suchma- schine zur Verfügung gestellt wie Bing oder Goog- le. Die Treffer zu einer Suche werden als RSS Feed4 • Für die im Marktanalyse Szenario genannte Mindmap Funktionalität wurde die Visual Unders- tanding Environment (VUE) zurückgeliefert, der auch die strukturierten Falldaten enthält. Dies ermöglicht es insbesondere auch, eine Suche als Abonnement zu einem Stan- dard RSS Feed Reader hinzuzufügen und damit bei neuen Treffern zur gespeicherten Suche auf dem Laufenden zu bleiben. 5 genutzt. VUE verar- beitet RSS Feeds und erlaubt das strukturieren und verbinden von Inhalten und Metadaten per Drag und Drop aus einem RSS Objekt. Beinhaltet ein Fall im Trefferfeed z. B. eine neue erkannte Per- son, so kann diese als Konzept auf die Arbeitsflä- che gelegt und anschließend mit der Klasse Perso- nen verbunden werden. Noch zu realisieren ist ein Rückkanal von VUE in die Ontologie des Memex360, um modellrelevante Änderungen au- tomatisch zu übernehmen. Alternativ wird zurzeit der Einsatz von Wise Mapping6 Eine weitere wünschenswerte Erweiterung ist das Abonnement von Quellen für Instanzen aus der sogenann- ten Open Linked Data Cloud. Aktuell erfordert dies noch einen expliziten Import aus RDF Dateien oder einem Triplestore. als Web basiertes Modellierungs- und grafisches Recherchewerkzeug geprüft. 2 http://www.delicious.com 3 http://www.zotero.org 4 RSS als Protokoll stellt kein Sicherheitsrisiko dar, solange das Memex360 Backend im Unternehmensnetzwerk betrie- ben und per VPN zugegriffen wird 5 http://vue.tufts.edu/ 6 http://www.wisemapping.org/ 277 Neben der Benutzung über Werkzeuge mit Schnittstel- len mit Web 2.0 Standards verfügt Memex360 über eine REST API im JSON Datenformat. Auf diese Schnittstelle können – in Ergänzung zur Benutzung durch die Stan- dardwerkzeuge dedizierte Benutzerschnittstellen aufge- setzt werden. Mit dem JSON/REST API wurde z. B. eine dedizierte Suchschnittstelle für einen Demonstrator in der Life Science Domäne implementiert. Abbildung 2 zeigt das Interface, welches mit JQuery realisiert wurde. 5. Vergleichbare Ansätze Mit diesen Funktionalitäten hat Memex360 das persön- liche Wissensmanagement im Fokus. Aus Platzgründen werden im Folgenden zwei ausgewählte ähnlich gelagerte Ansätze präsentiert, welche aus Sicht der Autoren die stärkste Ähnlichkeit zum Memex360 Konzept aufweisen: Semantische Wikis und Semantische Desktops. Durch semantische Wikis (z.B. [Oren et al 2006]) kön- nen persönliche Notizen als Wiki-Seiten – und damit mit einem eigenen GUI - mit Metadaten angereichert und durch Links miteinander in Bezug gesetzt werden. Exter- ne Informationsobjekte können verlinkt werden. Damit kann ein semantischer Wiki Funktionen der zuvor genann- ten Standard-Applikationen übernehmen (z.B. Bookmarks in einer Liste strukturieren). Semantische Desktops (z.B. [Sauermann, 2005]) stellen wiederum eine Infrastruktur und entsprechende GUI-Komponenten zur Verfügung, um Informationsobjekte mit semantischen Annotationen (Metadaten und Relationen) miteinander in Bezug zu setzen. Im Fokus dieser Auszeichnung sind dabei die auf dem Desktop verfügbaren Informationen. Beide Applika- tionsklassen enthalten dabei sowohl relevante Daten als auch Informationen über das Wissensmodell des Benut- zers. Aus Sicht von Memex360 können diese Applikati- onsklassen als weitere Werkzeuge fungieren, um den Wissensraum des Benutzers zu erschließen. Sie stellen damit weitere Komponenten im Informationsökosystem dar, die aber nicht Memex360 als Monitor des Systems ersetzen. 6. Zusammenfassung und Ausblick Memex360 integriert die unterschiedlichsten Informati- onsquellen und baut dabei auch Brücken zwischen diesen Quellen. Damit wird durch Memex360 der Dispersion der persönlichen Information begegnet. Die vorgestellten Ideen und Konzepte werden im Rah- men des THESEUS ORDO Projektes weiterentwickelt. Danksagung Das diesem Bericht zugrundeliegende Vorhaben wurde mit Mitteln des Bundesministeriums für Wirtschaft und Technologie unter dem Förderkennzeichen 01MQ07005 gefördert. Die Verantwortung für den Inhalt dieser Veröf- fentlichung liegt bei den Autoren. Literatur [Anonymous] Anonymous. Blogger APIs - Google Code. http://code.google.com/apis/blogger/. [Bush, 1945] Bush, Vanevar. 1945. As We May Think - Magazine - The Atlantic. The Atlantic (Juli): 47-61. http://www.theatlantic.com/magazine/archive/1969/12/ as-we-may-think/3881/. [LOD] Linked Data Org. Linked Data | Linked Data - Connect Distributed Data across the Web. http://linkeddata.org/. [Winer, 2002] Winer, Dave. 2002. RFC: MetaWeblog API. März 14. http://www.xmlrpc.com/metaWeblogApi. [Winer, 2003] Winer, Dave. 2003. RSS 2.0 Specification (RSS 2.0 at Harvard Law). Juli 15. http://cyber.law.harvard.edu/rss/rss.html. [Oren et al 2006] Oren, Eyal et al. 2006. Semantic Wikis for Personal Knowledge Management (Database and Expert Systems Applications, LNCS). Springer Berlin [Sauermann, 2005] Sauermann, Leo. 2005. The semantic desktop-a basis for personal knowledge management (Proceedings of the I-KNOW). Abbildung 2: Memex360 - Suche nach Capsaicin mit JavaScript Interface 278 ABIS 2010 - 18th Intl. Workshop on Personalization and Recommendation on the Web and Beyond Melanie Hartmann Technische Universita¨t Darmstadt Darmstadt, Germany melanie@tk.informatik.tu-darmstadt.de Eelco Herder Daniel Krause L3S Research Center Hannover, Germany {herder, krause}@L3S.de Andreas Nauerz IBM Research and Development Bo¨blingen, Germany andreas.nauerz@de.ibm.com 1 The ABIS Workshop Series ABIS 2010 is an international workshop, organized by the German SIG on Adaptivity and User Modeling1 of the Gesellschaft fr Informatik. ABIS takes place in connection with LWA2010, together with related workshops on infor- mation retrieval, machine learning and information man- agement. 2 About ABIS 2010 Personalization has become a core feature on the Web and beyond: Google provides personalized search results. Amazon recommends books and other products. Facebook suggests friends and groups. Personalized features and rec- ommendations include items that were appreciated by sim- ilar users or the users friends and are typically based on a users profile data, the users current location or items that the user browsed, searched, tagged or bought earlier. Mash- ups and cross-application linking of user profiles promise to provide even more relevant suggestions and services. Personalization is great. But personalization can go aw- fully wrong, too. Systems may draw wrong conclusions about your search actions and constantly annoy you with personalized menus that do not work or recommendations for books that you couldnt care less for. And do you really want your friends and colleagues to know what products you searched for yesterday? Our keynote speaker will be Ju¨rgen Geck, CEO of Open- Xchange2, a company that provides integrated tools for mail and collaboration to millions of users in Europe. 3 Workshop Overview The program commitee received submissions from re- search and industry within the broad area of User Model- ing and Adaptive Systems. Special emphasis of this year’s workshop was on submissions in the following areas: • Obtaining user data: logging tools, aggregation of data from social networks and other Web 2.0 services, location tracking • Modeling the user data: collaborative filtering, cross- application issues, contextualization and disambigua- tion, use of ontologies and folksonomies • Personalization and recommendation: applications in social networks, search, online stores, mobile comput- ing, e-learning and mash-ups 1http://abis.l3s.de 2http://www.open-xchange.com/ • Evaluation and user studies: laboratory studies, em- pirical studies and analysis of existing corpora of us- age data • Emerging and important issues: future applications, new paradigms in human-computer interactions, pri- vacy awareness 4 Program Chairs • Melanie Hartmann (Technische Universita¨t Darm- stadt, Germany) • Eelco Herder (L3S Research Center, Germany) • Daniel Krause (L3S Research Center, Germany) • Andreas Nauerz (IBM Research and Development, Germany) 5 Program Committee • Fabian Abel (Leibniz University Hannover, Germany) • Maria Bielikova (Slovak University of Technology, Slovakia) • Susan Bull (University of Birmingham, UK) • Betsy van Dijk (University of Twente, Netherlands) • Birgitta Knig-Ries (Universita¨t Jena, Germany) • Peter Dolog (Aalborg University, Denmark) • Eelco Herder (L3S Research Center, Germany) • Sabine Graf (Athabasca University, Canada) • Melanie Hartmann (Technische Universita¨t Darm- stadt, Germany) • Dominikus Heckmann, DFKI Saarbru¨cken, Germany) • Nicola Henze (Leibniz University Hannover, Ger- many) • Vera Hollink (Centrum Wiskunde & Informatica, The Netherlands) • Daniel Krause (L3S Research Center, Germany) • Tsvi Kuflik (University of Haifa, Israel) • Erwin Leonardi (Universiti Tunku Abdul Rahman, Malaysia) • Andreas Nauerz (IBM Research and Development, Germany) • Alexandros Paramythis (Johannes-Kepler-University, Linz, Austria) • Wolfgang Reinhardt (Universita¨t Paderborn, Ger- many) • Sven Schwarz (DFKI, Germany) • Marcus Specht (Open University of the Netherlands, Netherlands) • Stephan Weibelzahl (National College of Ireland) 279 Abstract The work presented in this paper summarizes the research performed in order to implement a set of Units of Learning (UoLs) focused on adaptive learning processes, using the specification IMS Learning Design (IMS-LD). Through the imple- mentation and analysis of four learning scenar- ios, and one additional application case, we iden- tify a number of constraints on the use of IMS- LD to support adaptive learning. Indeed, our work in this paper shows how IMS-LD expresses adaptation. In addition, our research presents a number of elements and features that should be improved and-or modified to achieve a better support of adaptation for learning processes. Fur- thermore, we point out to interoperability and au- thoring issues too. Finally, we use the work car- ried out to suggest extensions and modifications of IMS-LD with the final aim of better support- ing the implementation of adaptive learning processes. 1 A brief description of the IMS Learning Design IMS Learning Design (or simply IMS-LD) [IMS, 2003] is aimed to transform regular lesson plans into interoperable Units of Learning (UoL). This specification is able to use any pedagogical model to get a UoL run-able and editable in an interoperable way. IMS-LD augments other well- known e-learning specifications aforementioned, like SCORM, IMS Content Packaging, IMS Question and Test Interoperability or IMS Simple Sequencing. Furthermore, IMS-LD provides a language to describe the teaching and learning process in a Unit of Learning. It describes among other things the roles, the activities, the basic information structure, the communication among different roles and users; and all these under the pedagogical approach de- cided by the teacher and-or the learning designer. In this section, we show what is IMS-LD and how it is struc- tured, as well as how it provides Adaptation within the UoLs IMS-LD is able to describe a full learning flow with sev- eral elements -such as roles, activities, environments or resources- and features -such as properties, conditions, monitoring services or notifications [Burgos & Griffiths, 2005; Koper & Tattersall, 2005]. The usual life-cycle starts with a lesson plan modelled according to the IMS-LD specification, defining roles, learning activities, services and several other elements, inside an XML document called Manifest. An information package written in IMS Content Packaging [IMSCP, 2001] is used as a container for the resources and links them with the IMS-LD structure. Later, the Manifest is packaged with the nested resources in a compressed ZIP file, meaning a UoL. Several examples available are shown later on. IMS Learning Design uses the metaphor of a theatrical play to visualize how to model Units of Learning. A play is performed by a number of actors, who may take up a number of roles at different times in the play. Similarly in learning design a learner can take up different roles at different stages of a learning process. At the end of each act the action stops, all the learners are synchronised, and then a something new can begin. IMS-LD consists of three levels: Level A, with the defini- tion of the method, plays, acts, roles, role-parts, learning activities, support activities and environments. It is the core of the specification, contains the description of the elements that configure IMS LD and the coordination between them. For instance, role-parts define what activi- ties must be taken by a role in order to complete an act and, subsequently, a play. Level B, adds properties, conditions, calculations, moni- toring services and global elements to Level A, and pro- vides specific means to create more complex structures and learning experiences. Properties can be used as vari- ables, local or global ones, storing and retrieving informa- tion for a single user, a group or even for all the characters involved. Through these mechanisms the learning flow can be changed at the run time, as decisions can be made taking into account dynamic content. Logically it is the used level to express the most of the pedagogical needs concerning Adaptation, personalization, feedback, track- ing and several other usual requests of teachers and learn- ing designers. Finally, Level C adds notifications to Level B, meaning an email sent and a show/hide command to a specific activ- ity, depending on the completion of another one [Koper & Burgos, 2005]. What is wrong with the IMS Learning Design specification? Constraints And Rec- ommendations Daniel Burgos1, 2 1Atos Origin. Research & Innovation Albarracin 25, Madrid 28027, Spain daniel.burgos@ atosresearch.eu www.atosresearch.eu 2International University of La Rioja Gran Via Rey Juan Carlos I 41, Logroño, La Rioja 26002, Spain www.unir.net 281 2 IMS-LD and Adaptation In addition to the basic structure of Level A, the elements in Level B and Level C are actually the key for more ex- pressive UoLs (for instance, based on Adaptation or Col- laboration), as they combine several features that encour- age and make the content and the learning flow more flexible [Koper & Burgos, 2005]. Furthermore, the com- bination of these elements allows for the modelling of several classical adaptive methods (i.e. reuse of pedagogi- cal patterns, adaptability, navigational guidance, collabo- rative learning, contextualized and mobile distributed learning, Adaptation to stereotypes), making use of differ- ent structural elements of IMS-LD, like i.e. Environment, Content, User groups and Learning flow [Burgos et al., 2007]. In a literature study, we identify eight different kinds of Adaptation being carried out in eLearning systems [Bur- gos, 2008]: Interface based, Learning flow based, Content based, Interactive problem solving support, Adaptive in- formation filtering, Adaptive user grouping, Adaptive evaluation, and Changes on-the-fly. All of them use vari- ous inputs provided during the learning process and aim to tune the activities and actions of the learner to get the best learning experience as possible [Butz et al., 2003]. A wide and consistent set of rules of dependencies among users, methods and learning objects is needed to describe these eight types of Adaptation, and moreover their possible combinations. If we categorize all these types of Adapta- tion, we can group them in two clusters [Ahmad et al., 2004; Chin, 2001; De Bra et al., 2004; Baeza-Yates & Ribeiro-Nieto, 1999; Van Rosmalen & Boticario, 2005; Merceron & Yacef, 2003; Romero et al., 2003]. The first one consists of three types of Adaptation: 1. Interface-based (also called adaptive navigation and related to usability and adaptability) where elements and options of the interface are positioned on the screen and their properties are defined (color, size, shadow, etc.); this is closely related to general customization and supporting people with special needs which influence personaliza- tion, such as colour impairment or poor hearing, for in- stance. 2. Learning flow-based, where the learning process is dy- namically adapted to sequence the contents of the course in different ways. The learning path is dynamic and per- sonalised for every student, but even also for every time that the course is started (also called run or instance), so that the student can take a different itinerary depending on his performance. 3. Content-based, where resources and activities dynami- cally change their actual current content, as in Adaptive and Intelligent Web-Based Educational Systems based on adaptive presentation [Brusilovsky & Miller, 2001]. For instance, the information inside a learning activity can be classified in three levels of depth, and every level is shown based on a number of factors. The first cluster with three types of Adaptation becomes the base for the next one. Additional kinds of Adaptation feed a second cluster: 4) Interactive problem solving sup- port; 5) Adaptive information filtering, 6) Adaptive user grouping; 7) Adaptive evaluation; and 8) Changes on-the- fly. 3 Methodology of analysis This section describes how we have carried out the analysis, as well as the methodology followed to do the research in this paper. Previously, we have described how adaptation is envisaged by IMS-LD and which types of adaptation can be expresses with this specification. Fur- thermore, we have described, modelled and implemented a number of Learning Scenarios which show features for adaptive learning processes. First, we have defined, modelled and analysed five Units of Learning (UoLs), which are described as learning scenarios (Table 1). In these learning scenarios, we de- scribe adaptive learning processes and features. Further, we carried out an analysis of a real application case from the ATOS University, where a Unit of Learning (UoL) with adaptation features modelled with IMS-LD, was im- plemented (Figure 1). ID Type of adaptation Description 1 Adaptive Assessment adaptation on the learner´s performance and knowledge 2 Adaptive Authoring adaptation on the learning designer´s method 3 Adaptive Content adaptation on the learner´s decision 4 Adaptive Mentoring adaptation on the teacher´s decision 5 Combination of adap- tive types Application case on Corporate training Table 1. Learning scenarios Last, every learning scenario is analysed and reports on shortcomings and recommendations to improve the ex- pressiveness of IMS Learning Design to achieve a better adaptation process. Figure 1. ATOS application case Our analysis is focused on the main challenges and limitations to performing adaptive learning with IMS-LD. These mainly focus on the need for improving the flexibil- ity and interoperability of this specification, while model- ling adaptation. All of them are available at the GRAPPLE Project website1. 1 www.grapple-project.org 282 4 How IMS-LD expresses Adaptation In this section, we examine how IMS-LD can be used to represent each of the eight types of Adaptation afore- mentioned. A combination of the following proposals on Adaptation could support the performance of every role in an eLearning process. Taking the first block (which con- sists of the three main types), IMS-LD is able to model Adaptation: 4.1 Adaptation based on the interface Interface Adaptation is based on options, navigation and visualization facilities. Interface Adaptation is not possible with today’s tools for IMS-LD, such as Copper- Core2 Player, Reload LD Player and Sled. As long as the Adaptation of the interface is based on the tool and not on the Unit of Learning that is interpreted by the player, this is still true. Today’s players do not yet provide facilities to change the size or the position of the navigation panels, or even open and close the working areas in the player. Ei- ther, these tools cannot change the style sheets related to a HTML file, part of the content, and any of the linked fea- tures, as font-size, font-type or background colour, for instance. Although the CopperCore engine provides the appropriate infrastructure, no player uses it so far. Never- theless, some kind of adaptive interface is possible, using DIV layers and environments. 4.2 Adaptation based on the learning flow The modification of the learning flow as the Unit of Learning is being executed is one of the most often used types of Adaptation. Taking the flow as a base, the Unit of Learning provides different activities, resources and ser- vices, depending on these four inputs during execution (user´s behavior and performance, user´s decision, teacher and set of rules). The activity structure in an IMS-LD UoL is defined using plays, acts, activity structures, learning activities, support activities and environments. We can also use the property of visibility to hide and show these elements and to adapt the learning flow. In these cases the property works as a flag, switching on and off the ele- ments referred to. 4.3 Adaptation based on the content Content Adaptation is based on the information inside an activity that is shown and handled. We know that a learning flow is mainly focused on the sequence of the activities in a Unit of Learning. However, content based Adaptation is focused on the information of every activity, and on the activity itself. There are two main approaches for content based Adaptation in IMS-LD: Flag properties and content of properties. Flag properties hide and show elements like e.g. activities or environments. On the other side, the content of specific properties can be modified on the run, making use of global elements in the specifica- tion. 4.4 Elements in Levels B and C to model Adap- tation The elements in Level B and Level C providing support to Adaptation in Units of Learning are categorized as a) properties, b) conditions, c) global elements, d) calcula- 2 www.coppercore.org tions, e) monitoring services, and f) notifications [Koper & Burgos, 2005; Burgos & Specht, 2006]: 1. Definition, set-up and use of properties: Properties are taken as variables to store values. There are several types of properties: local, local-personal, local-role, global-personal, global. There is also a property-group that is able to compile a number of the others. 2. Conditions: IMS-LD is able to define a basic struc- ture if-then-else, or multiple structure with several chained basic if-then-else in a row, for instance to change the value of a property or to show and hide one element. 3. Global elements: Global elements provide a commu- nication flow between the imsmanifest.xml, where the different levels of IMS-LD are set-up, and other XML files. Mainly, they can get an input from the user and they can show a value of a property. Furthermore, they can manage DIV layers in XHTML, for instance to show and hide specific content. 4. Calculations: IMS-LD is able to make some basic arithmetic’s (sum, subtraction, multiplication and divi- sion) and some combination of a number of them in a row, to get a more complex formula, like a simple aver- age, for instance. 5. Monitoring service: The specification allows moni- toring any kind of property assigned to a user or a role, for instance. In order to start this action, firstly the component monitor must be set-up inside an environment and later the property can also be monitored. 6. Notifications: An action is automatically launched de- pending on the state of a property or a previous action, i.e., when a student ends an assignment an email is sent to the tutor. 5 Identification of constraints, gaps and is- sues to cope with We use every learning scenario aforementioned as a base to find restrictions, drawbacks and elements to im- prove within the specification. These resources show how far IMS-LD supports adaptation, when different inputs and roles are involved. We also make links to the integra- tion of UoLs, when needed. Out of the modelling and de- velopment of those UoLs we perform an analysis on which features, elements and components are missing or could be modified in order to achieve a more adaptive- and expressive-oriented general definition, with the ulti- mate aim of improving the specification and bringing it closer to actual needs on eLearning. In this section, we provide a detailed analysis of what IMS-LD can and cannot model, in its current information model, with regards to adaptation. This analysis concen- trates on the weak points and main features of every learn- ing scenario. These remarks will be addressed to produce a set of recommendations (i.e. extensions and modifica- tions) to improve the pedagogical expressiveness on IMS- LD, focused on adaptation, in the next section. Following, we summarize our main findings. With re- gards to the specification itself: 1. The definition of properties and the link through sev- eral working XML files is too complicated to become useful 2. The relation between layers and actions is not straightforward and it has to be done interlacing files, through global elements and XML 283 3. The lack of a richer conditional structure makes the editing of the set of rules more complicated on paper than they actually are from a rational point of view 4. Controlled iterations in the activities are not allowed. Furthermore, a closed activity cannot be re-initialized and/or go backwards 5. The monitoring service doesn´t cover any kind of user grouping. Therefore, a user (e.g. either a teacher or a learner) cannot follow the performance of several other users at the same time 6. Questions and answers are not personalised for user; they are identical for all users with the same role 7. The communication between teacher and student is little and indirect. They can view the values of properties but there is no other communication service between them 8. There is a lack of flexibility in the input point of changing the itineraries. In the type Sequence, the learn- ing activity with the question appears always at the same place. In the type Selection, the question is always pre- sented after 2 completed learning activities. In case the learning designer/teacher wants to shift this input point, they cannot do so 9. There is no possibility to handle absolute time to start the course and/or a specific activity. Only relative time to the precise time when the instance is created out of the UoL, it is possible 10. There is no chance to make a connection to an al- ready existing database (for instance, to make a query or to import already enrolled students or teachers). The data type of connection is not supported. Therefore, every en- rolment has to be done by hand or running a specific tool for that 11. Furthermore, any connection with the external world is impossible. For instance, a real-time effective communication between an LMS and an IMS-LD UoL is not possible so far, so that in fact they cannot benefit each other from mutual services and resources. There is no foreseen dispatcher or service in the specification allow- ing such connection [Moreno et al., 2007] 12. When an executable module is developed with other technologies (Macromedia Flash and PHP, for in- stance), it cannot be integrated with IMS-LD in any way. Therefore, we also identified an interoperability problem. Although IMS-LD is not developed with the intention of supporting such interactivity with users, it could allow for a valid integration with external resources using a layer of communication/dispatcher. 13. A file uploaded from the hard disk of a computer is stored in a file-type property inside the internal database of the engine (CopperCore, in this case). There is no pos- sibility to change the default configuration for storing or retrieving resources. There is no facility to manage those uploads either. Although this is an issue concerning tools too, the core documents of IMS-LD do not provide this information and/or service either 14. IMS-LD does not allow saving information into external files or retrieving information from any external source 15. To perform a dynamic user selection in order to create groups is not possible. The teacher can monitor each user, and provide him/her with some feedback on a personal basis. We could set-up a property to be dealt by groups, but these groups should be established before the actual start. However, if the teacher wants to make a dy- namic creation of a group of students depending on their answers, this is not possible so far. To this extent, groups and roles are the same thing 16. IMS-LD does not allow for recording the user’s behaviour; in fact, no measures (i.e., Total Time Needed, Time Before First Move) can be restored or retrieved 17. As a consequence, adaptation based on the user´s behaviour cannot be developed using the IMS-LD specifi- cation. Furthermore, the current state of tooling does not support it either In addition, with regards with the current engines, we highlight a few issues that would support a more powerful use of the specification: 18. Changes on-the-fly are not possible. In case that the teacher or the learning designer wants to change i.e. the questions, the answers, or the content of the next activity to be carried out, they find that. Every single resource has to be packed in design and publishing time before the ac- tual running of the instance 19. In questionnaires and other forms with fields, the teacher/learning designer cannot modify the number of questions or answers, once the UoL has started 20. There is no option to run the UoL (the whole UoL or a part, such a Learning Activity) twice within the same instance. Once a Learning Activity is closed, the user can read it again but the associated learning flow cannot be executed. For instance, after the question to change the itinerary is made in the historic-route, there is no way to go back 21. There is no flexibility to change the content. When the teacher/learning designer wants to keep the same method and the same structure, but he/she wants to change one single HTML page with some content, the UoL has to be validated and published again. In this case, the learner and the teacher would have to be enrolled and the learning process starts from the very beginning 22. Users cannot be dynamically enrolled within the UoL, once it has started, and they have to be managed by an external tool 6 Further analysis In the next section we show specific recommendations which deal with extensions, modifications of modelling structures, elements and components, as well as with the architecture of IMS-LD. Those recommendations are based on the constraints pointed out in Section 5. How- ever, there is a need for presenting some further analysis, which can bridge both sections, from the constraints to the recommendations, since this in-between step is crucial to understand the rationale. We have organized the analysis as follows: a) Analysis on general-purpose modelling. These ele- ments will be used as part of others specifically imple- mented in learning processes, like personalisation. Fur- thermore, they become a basic set to be re-purposed in different contexts and goals. Therefore, this initial analy- sis comprises adaptive learning. A few very specific proc- esses cannot be approached with just general structures. They need on-purpose elements which come across on- purpose goals on personalisation b) Analysis on the integration of Units of Learning and a bi-directional communication with other external re- sources, systems and standards. When needed, we high- light the need for a way of communication (e.g., a com- munication layer) although its development is something outside of the scope of this research. We are focused on 284 the specification itself and how to improve the pedagogi- cal expressiveness, and not on building any ad hoc techni- cal artefact to get this aim through. Out of this analysis, we conclude that specific recom- mendations should be categorized in three groups: a) Modelling, that compiles every single extension, modification or addition, general or specific, to the speci- fication and the information model; and b) Architecture, that deals with functional requirements of the spec, with a focus on the interoperability, communication and integra- tion of IMS-LD with other external means. In both cases, we look for the highest performance along with the mini- mal structural change. Furthermore, we respect the origi- nal specification as much as possible and try to make as few changes as possible; on the other side, they all are needed to build the suggested solution, and cope with the overall approach. In addition, c) we reflect some recom- mendations about the authoring tools. Although they are not responsibility of the specification, they are indeed related to IMS-LD, since the tools which allow the end users to create useful and applicable Units of Learning, can make the process easier or more difficult, and there- fore it constraint the actual use and outcomes. Furthermore, we depict our conclusions within the same two main blocks that we have used to carry out the analysis: modelling (with a special focus on adaptation) and integration. Out of our solution, we also provide a brief note about authoring tools. 6.1 Modelling and Adaptation With regards to general modelling, and modelling fo- cused on adaptive learning we conclude that IMS-LD shows a metaphor difficult to understand. It is not as much to say that people do not understand what a theatre is or how a play is performed. The key issue comes when a teacher needs to translate this well-know structure into specific pedagogical resources and features. This transla- tion process turns not to be so obvious. The conceptual model is clear: play, acts, roles, role-parts, and so on. But all of them, interlaced in a whole structure of learning, become complex. Even the simplest scenario requires some knowledge of the specification in a technical way. And this is far from being user-friendly, moreover when the usual target people consists of non-technical profiles. The notation itself follows a usual XML Schema and the definition of the several elements and components of the spec can turn too complex, even for skilled program- mers. The description of activities, activity structures, environments, and etcetera, and the long cascade of rela- tionships amongst them, makes a difficult-to-trace chain out of a simple scenario. Not to mention when several roles are involved, when some components of Level B are used or when adaptive processes are required. The pro- gramming structure is quite easy, but the combination of elements, components and metaphor, makes it hard diffi- cult to implement. The programming components provided by IMS-LD are quite simple (i.e., simple condition, based arithmetic, visualization of variables, visibility, DIV layers, and et- cetera). On the other side, their syntax is long, which hin- ders the rationale of the modelling process itself. 6.2 Communication, interoperability, integra- tion of Units of Learning We study three ways of communication: 1) simple link between parts, 2) embedded information packages with no information exchange, and 3) full communication of in- formation packages, sharing variables and states. This third solution becomes the most effective one. It implies the development of a communication layer that deals with effective bi-directional exchange of data between infor- mation packages. Furthermore, this solution allows for the communication and sharing of services, along with vari- ables, values and states, between IMS-LD and any outside counterpart, i.e., other specifications (e.g. SCORM), lan- guages (i.e. PHP, Java, and Action Script), and LMSs (i.e. LAMS3, Moodle4, .LRN5). Should this exchange actually happens, it will encour- age the re-use of information packages in different con- texts, and the development of templates, fostering the re- purpose of Units of Learning within and amongst the sev- eral communities of practice (target groups) involved in IMS-LD, beyond the very only technical niche. In the same line, exportation and importation of Units of Learning is not developed so far; neither does any con- nection with a database. Once more, no information ex- change with other entities is possible so far. The current two-step working process that makes two isolated parts out of design-time and run-time, makes IMS-LD to be compiled and not interpreted. This distinc- tion stops an on-the-fly visualisation and modification of the learning design, which would improve the interactive personalisation of the learning process. This issue deals with how IMS-LD is interpreted by tools and engines de- velopers and not with how the specification is actually designed. 6.3 Authoring As aforementioned, this research and paper are focused on the specification itself and it does not deal with tools. However, authoring tools largely influence what can be modelled and how. Therefore, we point out a couple of key issues that could support the actual adoption of IMS- LD by the target groups: a) There is a need for high-level visual authoring tools. Nowadays there are two types of tools: effective but too technical, even for technical profiles; and simple to under- stand but not powerful, since they usually deal with the very basic Level A. The creation of UoLs should be as far as possible from technical requirements or the underlying elements, components or structure. A more visual ap- proach would encourage the understanding and use of IMS-LD in a broader sense by target groups. Technical low-level editors should live along with the visual high- level ones, though b) Any authoring tool should allow for an integrated mod- elling, working with the manifest, the resources and the required external XHTML files with a common interface. It should dependencies and ease setting of properties. This is a hot challenge, not possible so far. 3 www.lamsfoundation.org 4 http://moodle.org 5 www.dotlrn.org 285 7 Recommendations: Extensions and modi- fications This section presents a rich and structured set of recom- mendations, modifications and extensions to improve the expressiveness of IMS Learning Design on adaptive learn- ing processes. It lays on the aforementioned analysis. The following set of tables show a summary of the constraints, analysis, and recommendations (Table 2). The tables are structured as follows: in the grey-coloured, first row of each table, Column 1 (ID) numbers the constraints and analysis issues. Prefix M relates to issues concerning Modelling, and prefix A relates to issues concerning Ar- chitecture. Column 2 (Constraints...) provides a descrip- tion of those issues numbered in Column 1. The white- coloured row(s) afterwards, presents the recommenda- tion/s in the same couple format: ID and description. ID Constraints, analysis and recommendations [M.01] Programming structures and resources are very basic (simple condition, simple arith- metic, properties set-up, visibility, DIV layers) [Rec.01a] Condition type case [Rec.01b] Condition type case with automatic ranges [Rec.01c] Conditional loop, type while [Rec.01d] Integer loop, type for-next [Rec.01d] Modification of the element [M.02] There is no management of absolute time. There is no synchronization nor input point to work with relative time from [Rec.02] Modification of reference to relative time. Addition of reference to absolute time [M.03] Notification service, in Level C, is under- used. It only sends an email or plays an activity [Rec.03] Extension of the notification service, be- yond using sendmail and playing an activ- ity. It can be called from other structures besides the part of a learning activity [M.04] There is a blur way to handle the definition and use of properties and links amongst the several XML with global elements [Rec.04] Syntax modification, definition and use of elements view-property and set-property, as long as the properties which make use of them [M.05] Relationship between DIV layers and the visibility property is difficult to make and follow [Rec.05] In principie, the visibility property of any layer is turn off (hide), making simpler the condicional structure which could make use of it [M.06] There is no chance for iterations in any of the basic structures of the IMS-LD meta- phor (learning activity, support activity, activity structure, act, play) [Rec.06] Extension of the current syntax of every element with a parameter which defines a integer loop (type for-next) and-or a conditional loop (type while) [M.07] There is no synchronization input point in the manifest [Rec.07] Addition of an element GOTO which al- lows for a direct guiding of the learning flow [M.08] There is no chance to assign a specific ac- tivity to a selected user [Rec.08a] Addition of an element ASSIGN- ACTIVITY-TO-USER which allows for a direct match amongst users, groups and roles, with learning activities and activity structures [Rec.08b] Addition of an element ASSIGN-USER- TO-ACTIVITY which allows for a direct match amongst users, groups and roles, with learning activities and activity struc- tures [Rec.08c] Addition of an element SWITCH- ACTIVITY which allows for turning on- off activities and activity structures [M.09] There is no chance to make groups out of a selection inside the instance [Rec.09] Addition of an element CREATE-GROUP which allows for grouping users of the same role [M.10] The monitoring service does not allow for monitoring of groups [Rec.10] Extension of the monitoring service to trace roles and groups [A.11] IMS-LD does not allow for saving or re- trieving data in external files, of any kind of format. In addition, connections with external databases or modules developed with other languages are not described or supported within the specification [Rec.11a] Addition of the elements EXPORT and IMPORT to handle files with specific pa- rameters (e.g., type TXT) and which is defined in a new property type FILE-IO [Rec.11b] Addition of an the elements FROM-DB and TO-DB which allows for saving and retrieving data in a database of type MySQL. The connection is defined in a new property type DATABASE [A.12] There is no chance to modify the learning skeleton, method, roles definition or any other structural element in run-time [Rec.12] Addition of two couples of global ele- ments: a) view-IMS-LD y set-IMS-LD, b) view-resources y set-resources, which al- lows for the visualisation and modification 286 of the learning design and the related re- sources in run-time Table 2. Constraints, analysis and recommendations At the project website pointed out in Section 3, every recommendation is expressed in an XML format, along with a full description, and one example. For instance (Figure 2): 3 and Activity to carry out First part of the activity A Figure 2. Example snippets of two recommendations 8 Conclusions and further work This paper shows the background about IMS Learning Design and how to model adaptive learning with this specification. In addition, we provide a thorough analysis of a number of learning scenarios and a detailed list of issues to be modified and improved in the specification to better express adaptation. Based on these outcomes we provide recommendations, modifications and extensions to IMS Learning Design in order to improve its expres- siveness of adaptive learning. With these regards, Level A of IMS-LD provides the basic skeleton and a general framework to work with Units of Learning. It makes the 80% of the whole struc- ture. Level C, and above all Level B provide both the spec with stronger and more versatile resources. These two upper levels are the actual responsible means to model some of the current learning and teaching challenges (i.e., active learning, collaborative learning, adaptive learning, runtime tracking). Furthermore, we examine how to represent adaptive and adaptable Units of Learning with IMS Learning De- sign in order to model different types of Adaptation. Based on a literature study, a distinction is drawn between eight types of Adaptation that can be classified in two clusters: a) the main group, with interfaced-base, learning- flow and content-base; b) interactive problem solving support, adaptive information filtering, adaptive user grouping, adaptive evaluation, and changes on-the-fly. Out of this research and modelling efforts we derived a number of findings focused on the limitations that IMS- LD provides. These findings are mainly focused on adap- tive learning process. However, since this topic cannot be isolated from the overall approach of the specifications, some of the limitations, and further recommendations, also address other topics, like interoperability, or even authoring tools. Indeed, IMS-LD will benefit from a re-structure and modification of several elements focused on modelling and architecture. It will also improve the overall peda- gogical expressiveness, along with specific features on adaptation of learning processes and integration with other specifications, LMSs, and learning resources. These are two main objectives of the specification: personalised learning and interoperability. At the same time, IMS-LD would increase its level of implementation in real settings and a wider support from Communities of Practice of end users if one or several high-level visual authoring tools are developed. Nevertheless, this issue is out of the scope of this research, and it deals with research groups and com- panies working on the adoption of IMS-LD. Acknowledgments The research presented in this paper has been partially supported by the following projects: FLEXO (Spanish Plan Avanza, www.ines.org.es/flexo, TSI-020301-2009- 9), GRAPPLE (FP7-ICT-2007-1, www.grapple- project.org, contract number 215434). References [Ahmad et al., 2004] Ahmad, A., Basir, O., and Has- sanein, K. Adaptive user interfaces for intelligent e- Learning: issues and trends. Proceedings of The Fourth International Conference on Electronic Business, ICEB2004, Beijing, 2004 [Baeza-Yates & Ribeiro-Nieto, 1999] Baeza-Yates, R., and Ribeiro-Nieto, B. Modern Information Retrieval. Boston, MA, USA: Addison-Wesley, 1999 [Brusilovsky & Miller, 2001] Brusilovsky, P., and Miller, P. Course Delivery Systems for the Virtual University. In Tschang F.T. & T. Della Senta (Eds.), Access to Knowledge: New Information Technologies and the Emergence of the Virtual University (pp. 167-206). Amsterdam: Elsevier Science and International Asso- ciation of Universities, 2001 [Burgos & Griffiths, 2005] Burgos, D., and Griffiths, D. The UNFOLD Project. Understanding and using Learn- ing Design, 2005. Available at http://hdl.handle.net/1820/548 [Burgos & Specht, 2006] Burgos, D. and Specht, M. Adaptive e-learning methods and IMS Learning De- sign. An integrated approach, Proc. ICALT2006, 2006. Available at http://www.ask.iti.gr/icalt/2006 [Burgos et al., 2007] Burgos, D., Tattersall, C., and Ko- per, R. How to represent adaptation in eLearning with IMS Learning Design. Interactive Learning Environ- ments, 15(2), 161-170, 2007 287 [Burgos, 2008] Burgos, D. Extension of the IMS Learning Design Specification based on Adaptation and Integra- tion of Units of Learning (Extensión de la especifica- ción IMS Learning Design desde la Adaptación y la In- tegración de Unidades de Aprendizaje). Doctoral thesis. University Carlos III, Leganés, Madrid, Spain, 2008 [Butz et al., 2003] Butz, M. V., Olivier, S., and Gérard, P. Anticipatory Behavior in Adaptive Learning Systems : Foundations, Theories, and Systems. In. Berlin: Springer Verlag, 2003 [Chin, 2001] Chin, D. Empirical Evaluation of User Mod- els and User-Adapted Systems. User Modeling and User-Adapted Interaction, 11, 181-194, 2001 [De Bra et al., 2004] De Bra, P., Aroyo, L., and Cristea, A. Adaptive Web-based Educational Hypermedia. In M. Levene & A. Poulovassilis (Eds.), Web Dynamics, Adaptive to Change in Content, Size, Topology and Use (pp. 387-410): Springer, 2004 [IMSCP, 2001] IMS. IMS Content Packaging Specifica- tion. Version 1.1.3, 2001. Retrieved April 28th, 2009, from http://www.imsglobal.org/learningdesign/index.cfm [IMSLD, 2003] IMS. IMS Learning Design. Version 1. Retrieved February 27th, 2004, from http://www.imsglobal.org/learningdesign/index.cfm [Koper & Burgos, 2005] Koper, R., and Burgos, D. De- veloping advanced units of learning using IMS Learn- ing Design level B. International Journal on Advanced Technology for Learning, 2(4), 252-259, 2005 [Koper & Tattersall, 2005] Koper, R., and Tattersall, C. Learning Design: A Handbook on Modelling and De- livering Networked Education and Training. Berlin Heidelberg: Springer, 2005 [Merceron & Yacef, 2003] Merceron, A., and Yacef, K. A Web-based tutoring tool with mining facilities to im- prove learning and teaching. Proceedings of AI- Ed'2003: IOS Press, 2003 [Moreno et al., 2007] Moreno-Ger, P., Burgos, D., Sierra, J. L., & Fernández-Manjón, B. (2007). An eLearning specification meets a game: authoring and integration with IMS Learning Design and . Pro- ceedings of ISAGA. 2nd international workshop on Electronic Games and Personalized eLearning Proc- esses (EGAEL2007). July, 9th-13th, 2007, Nijmegen, The Netherlands [Romero et al., 2003] Romero, C., Ventura, S., De Bra, P. D., and Castro, C. D. (2003). Discovering prediction rules in AHA! courses. Proceedings of 9th International User Modeling Conference (pp. 25-34), 2003 [Specht & Burgos, 2006] Specht, M., and Burgos, D. Im- plementing Adaptive Educational Methods with IMS Learning Design. Proceedings of Adaptive Hypermedia 2006, Dublin, Ireland, 2006 [Van Rosmalen & Boticario, 2005] Van Rosmalen, P., and Boticario, J. Using Learning Design to support de- sign- and runtime adaptation. In R. Koper & C. Tatter- sall (Eds.), Learning Design: A Handbook on Modeling and Delivering Networked Education and Training. Heidelberg, Germany: Springer Verlag, 2005 288 Student Model Adjustment Through Random-Restart Hill Climbing Ahmad Salim Doost Saarland University 66123 Saarbruecken, Germany Erica Melis German Research Center for Articial Intelligence (DFKI) Stuhlsatzenhausweg 3, 66123 Saarbruecken, Germany Abstract ACTIVEMATH is a web-based intelligent tutor- ing system (ITS) for studying mathematics. Its course generator, which assembles content to personalized books, strongly depends on the un- derlying student model. Therefore, a student model is important to make an ITS adaptive. The more accurate it is, the better could be the adap- tation. Here we present which parameters can be optimized and how they can be optimized in an efficient and affordable manner. This methodol- ogy can be generalized beyond ACTIVEMATH’s student model. We also present our results for the optimization based on two sets of log data. Our optimization method is based on random-restart hill climbing and it considerably improved the student model’s accuracy. 1 Preliminaries An intelligent tutoring system (ITS) is a computer system that provides personalized support (either by giving feed- back or instruction) to students performing tasks without the intervention of human tutors. They can improve ed- ucation quality by providing new possibilities (e.g. for homework assignments or self-study opportunities). There- fore, ITSs can be used to support teaching and diminish the teacher shortage problem [Flynt and Morton, 2009; Ingersoll and Perda, 2009] to some extent. ITSs can be per- manently available, widely reusable, location-independent and adaptive. Adaptivity is an important feature of ITSs: it allows to focus the education to pedagogically useful con- tent like teaching content the student has difficulties with and keeping the student from working on too easy or too difficult exercises. To make ITSs able to adapt to single stu- dents, an underlying student model aiming to represent cer- tain aspects of a student (most importantly his/her skills), is required. A student model represents variables and characteristics of a learner which are relevant for educational purposes. A typical learner variable is the student’s state of knowl- edge concerning various concepts and competencies. This is continuously adjusted with the learner’s progress over time. When developing a student model several parameters have to be determined which, in a first run, may be defined in an ad hoc way. However, these values are usually not appropriate and also wrong interrelations may cause prob- lems. Instead, an evaluated configuration might improve the accuracy of the student model, which in turn leads to a better individualized learning experience. How can we optimize parameters in a generic way? First of all, the parameters of ACTIVEMATH’s student model, also called Salient Learner Model (SLM), which we want to optimize are described. We demonstrate how we adapted the Random-Restart Hill Climbing (RRHC) al- gorithm [Russel and Norvig, 2003] and used it to adjust SLM. Finally, we provide evaluation results and describe how the methodology can be generalized beyond SLM. 1.1 ACTIVEMATH and SLM ACTIVEMATH is a web-based ITS for studying mathemat- ics1 [Ullrich and Libbrecht, 2008]. It is being developed at Saarland University and at the German Research Cen- ter of Artificial Intelligence (DFKI) since the year 2000. Its learning content (like definitions, examples, exercises and explanations), are semantically encoded in an extended OMDoc [Kohlhase, 2006]. The OMDoc documents are then converted by the presentation component to the de- sired output format (like HTML or PDF, cf. Figure 1). Students cannot only browse through prerecorded static books. With the course generator of ACTIVEMATH, learn- ing objects can be organized into a book which is specific to the student’s abilities: The generated book depends on the student’s goals and on the modeled belief about the learner’s competencies. The goal can be for example to discover a new topic, revise an already known topic or sim- ulate an exam. The content is selected according to special rules of the course generator and depends on the knowl- edge of the learner. The ACTIVEMATH course generator aims to generate user-specific courses with relevant learn- ing material and appropriate difficulty. For example, if the student wants to discover a new topic but his competencies on some of the prerequisites for the topic are poor, then the course generator can include them in the personalized book with an easy level of difficulty. The content of the person- alized dynamic books are updated and restructured with the student’s progress over time. The exercise system of ACTIVEMATH runs interactive problems with personalized feedback. Like other modules in ACTIVEMATH, the exercise system publishes events to notify the remaining parts of the system about the user in- teractions. The student model of ACTIVEMATH [Faulhaber and Melis, 2008] listens to these events and updates its be- lief after every exercise step. It takes the learner’s achieve- ments into account together with the exercise metadata (e.g. exercise difficulty). The most relevant metadata is the set of trained con- cepts and cognitive processes. Addition of fractions or Pythagorean theorem are examples for concepts. Concepts 1http://demo.activemath.org 289 Figure 1: A page of a book in ACTIVEMATH from a web-browser. can be connected with each other with relations like the prerequisite-relation specifying that in order to be able to master one concept, the prerequisite-concept has to be mas- tered first (e.g. the concept adding fractions with equal denominator forms a prerequisite of the concept adding fractions with unlike denominators). The cognitive process metadata defines what process (like detecting errors or ap- plying algorithms) is trained. We refer to the pair of a con- cept and a cognitive process as competency. For example, detecting errors about the Pythagorean theorem is a com- petency. Another very important metadata that exercises are required to be annotated with, is one of the five diffi- culty values: very easy, easy, medium, difficult or very difficult. Out of a set of evidences, consisting of a concept, a cognitive process, a difficulty value and the achievement (whether the exercise was solved correctly), the student model estimates students’ proficiencies. These evidences are generated after each approach to a problem. For each pair of a concept and a cognitive process (i.e. for each com- petency) defined in the metadata of an exercise, evidences are generated, propagated to related concepts (thus becom- ing indirect evidences) and stored in its specific contain- ers. The modeled belief about the student’s competency is then updated based on the available evidences, where indi- rect evidences have less influence on the resulting mastery value. Figure 2 illustrates this process. 1.2 Item Response Theory in SLM The calculation of the probabilities for different masteries is based on the Item Response Theory (IRT) [Lord, 1980], which describes the relationship between the probability for a correct answer X to a mastery value (ability) θ: Pi(X = correct|θ) = ci + 1− ci 1 + e−ai·(θ−bi) (1) The resulting function is called the item characteristic curve (ICC). Equation 1 shows the three-parameter model of the item response theory. The parameters are exercise (item) specific where • ai is the item discrimination factor, • bi is the item difficulty, • ci ∈ [0, 1] is the item guessing probability. The item difficulty bi defines the location of the mastery value θ where the probability of correct response is exactly 0.5, i.e. Pi(X = correct|θ = bi) = 0.5. Hence, the larger bi is, the larger the mastery θ has to be to have a probability of at least 50%. The discrimination factor defines the steepness of the S- shaped curve. Therefore, it describes how well it is differ- entiated between learners having masteries below the item difficulty (θ < bi) and those having masteries above the item difficulty (θ > bi). Figure 3 shows different ICCs with differing item difficulties. The actual domain of the mastery θ, the item discrimina- tion factor ai, and the item difficulty (or item location) bi is R. However, if one restricts the domain of the item dif- ficulty bi, then the most differing probabilities for different mastery values θ will be around that range as well. This is because of the nice property that Pi(X = correct|θ = bi) = 50%. SLM makes use of this and restricts the domain of θ and bi to [0.0, 1.0], which simplifies the interpretation of difficulty values. Furthermore, SLM restricts its calculation to those mas- teries θ which are “near” the item difficulty (i.e. θ ∈ [bi − δ, bi + δ], where δ is the information radius). This will keep the student model from assuming high mastery values for a student who solved easy exercises only. For more details on how the student model of ACTIVE- MATH works, please refer to [Faulhaber, 2007; Faulhaber 290 Student working on Exercise Generating ExerciseStep Event: (achievement, cognitive processes, concepts, confidence, difficulty, ...) Student Model (SLM) 2. Propagate Evidences along prerequisite relations of concepts 3. Store Evidences in the specific container for a competency pair (k,p) 4. Update Masteries for each competency pair (k,p) based on available evidences 1. Generate Evidences for each pair (k,p) of concept k and cognitive process p Figure 2: The interaction of a student with ActiveMath creates events, which are handled by the student model. −0.2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Mastery θ P ro b a b il it y bi = 0.0 bi = 1.0 bi = 0.5 Figure 3: Typical ICCs showing probabilities of a correct answer for different masteries θ and varying item difficul- ties bi. The guessing probability ci is fixed to 0 and ai to 10. and Melis, 2008; Doost, 2010]. 2 Which SLM Parameters can be Adjusted? Depending on the structure, design and implementation of a student model, different ways of adjustments and opti- mization possibilities emerge. This section presents most of the student model parameters that have been selected for the optimization process. Some of the parameters were actually hard coded into the system. In order to discover them, a thorough analysis of the implementation was nec- essary. Propagation Depth defines the maximum distance an ev- idence is propagated along the graph of prerequisites for concept relations. Too deeply propagated evi- dences might cause the SLM to transfer the student’s knowledge about one concept to other possibly too lit- tle related concepts. However, with a completely de- activated propagation we might lose possible accuracy gains for concepts for which direct evidences are not available yet. Weighing Evidences. How strong shall we weigh direct evidences compared to indirect ones when both are available? Giving no weight to indirect evidences is equal to having no propagation. However, giving them too much weight might understate the significance of direct evidences. Evidence Container Size is the maximum number of stored direct and indirect evidences per competency and represents the forgetting factor. When this num- ber is reached, old evidences are discarded in favor of new ones. Difficulty Mapping. What is the range [α, β] into which the five difficulty values are mapped (every exercise in ACTIVEMATH is annotated with one out of five possible values). Probabilities received from the IRT strongly depend on the difficulty of an exercise. An arbitrary mapping might lead to misinterpretations of difficulty metadata and negatively effect the mastery estimations. Default Discrimination Factor determines the curvature of the sigmoid function, which converges to a step function for high values making marginal masteries more probable and mid-level masteries less probable. 3 Random-Restart Hill Climbing Hill climbing [Russel and Norvig, 2003] is a greedy local search algorithm and can be used for optimization prob- lems. Hill climbing algorithms can find reasonable solu- tions in large or infinite (continuous) state spaces for which systematic algorithms fail. Generally, when looking for a maximum of a function (optimization problem), the hill climbing algorithm works as follows: 1. Start at an arbitrary point 2. Calculate values for neighboring points 3. Move to the point with increased value 4. Terminate if no higher value could be found, other- wise continue at 1 The standard problem with this algorithm is that it may not find the optimal solution (i.e. the global maximum), but only a local maximum. However, with an extension known as random-restart one can increase the probability to find a global maximum considerably [Russel and Norvig, 2003]. Starting the hill climbing algorithm over and over again each time with randomly chosen initial states and saving only the maximum of the new values improves the proba- bility of finding the global maximum. The complex structure of the student model and the pos- sibly interrelated parameters make it difficult - if not prac- tically impossible - to determine the optimal values for the 291 Initialize parameters with random values. Initialize step-size depending on range and type of current parameter. evaluate SLM Did accuracy improve? store best configuration step-size smaller than threshold? Choose next parameter. Initialize step-size depending on the parameter. refine step-sizemodify current parameter by step-size Any improvements with all parameters in a row? store current configuration if better than best global configuration START Maximum number of iterations reached? Output best con- figuration and STOP. Yes No Yes No No Yes No Yes Figure 4: A flowchart of the modified RRHC algorithm used for parameter optimization. parameters by classic machine learning methods. For opti- mizing the parameters mentioned in Sect. 2, the Random- Restart Hill Climbing (RRHC) algorithm has been mod- ified. Our version steadily switches between the several different parameters. It modifies each parameter as long as improvements could be found, then switches to the next and returns to that parameter again as soon as all other param- eters were taken into account. We restart with a new initial random state if no parameter change yielded improvements and terminate after a predefined maximal number of itera- tions. The continuous state space of parameters with real val- ues requires additional attention. In order to find the maxi- mum quickly, we need to make noticeable larger steps than the smallest possible difference. This is done by using an initial step size δ depending on the range size of possible values for a parameter. The larger the interval, the larger will be δ. The parameter is increased by δ as long as there is any improvement. Then, the step size is refined by a fac- tor 0 < α < 1 until the step size gets smaller than a prede- fined threshold ε (we used 0.1 for α and 0.0001 for ε). We continue with negative step sizes starting from −δ·α and increase it by multiplying with α again until we reached a value larger than −ε. In short, we first start with a rough search by modifying a parameter with larger steps, then we conclude with fine tuning by modifying a parameter with smaller and smaller steps until we reach a certain threshold for a minimum step size. A flowchart for our modified version of RRHC is pre- sented in Figure 4. 4 Evaluation and Results An important question remains: what is the best function for the optimization problem in order to find a good param- eter configuration for a student model? The student model can be evaluated by replaying it with empirical data collected during another experiment. Based on this data an experiment can be simulated for each possi- ble configuration without organizing new ones. Its log data can be rerun by going through all event entries chronolog- Start with empty models Send metadata of current exercise to SLM Achievement pre- diction by SLM Compare predicted and actual achieve- ment for statistics Inform SLM about actual achievement SLM updates his student model Next exercise Figure 5: The evaluation process of the student model SLM through replay of log data. ically and passing them over to the student model. Dur- ing the simulation process we can query the student model about its belief and compare it to the actual achievements before passing on this information to the student model. Figure 5 depicts a diagram of this evaluation process. We believe that the more accurate the student model is, the better are the achievement predictions, i.e. whether a student is able to solve a specific exercise or not. Our func- tion for the optimization problem is therefore the achieve- ment prediction accuracy which we strive to maximize. It is calculated as shown in Figure 6, where n is the overall number of exercise steps available (i.e. the total number of predictions), achi is the actual achievement in the ith exer- cise step2, and predi is the student model’s prediction for the achievement of the ith exercise step. For the evaluation, log data collected in the context of the ATuF project [Eichelmann et al., 2008] (with 190 stu- dents, 6th and 7th graders) and data from experiments of the ALoE project 3 [Tsovaltzi et al., 2009] (with 77 stu- dents, mostly 6th graders) has been used. 2It is 1 if the student’s answer was correct, otherwise 0. 3Funded by the German National Science Foundation (DFG) (ME1136/7). 292 accuracy = number of correct predictions overall number of predictions = 1 n · n∑ i=1 1− |achi − predi| Figure 6: Achievement Prediction Accuracy 4.1 Results For the ATuF data, the original SLM had an achievement prediction accuracy of 79%. This is already a highly im- proved and optimized value, which was the result of several years of experimentation. Still, with the parameter config- uration obtained from the presented method, the accuracy became 82%. In absolute numbers, this means that 470 out of 15680 predictions which were wrongly predicted before are now predicted correctly. For content selection based on the estimated competencies, every wrongly selected item (like selecting too easy or too difficult exercises, or leav- ing missing prerequisites away) could make considerable differences. A statistical hypothesis test (one-sided) with a statistical significance of 5% and the null hypothesis being the non- improvement of the prediction accuracy yields a z-score of −6.3. This means that we have to reject the null hypothe- sis (since the z-score is less than −1.65) and assume that the alternative hypothesis is true, namely that the accuracy improvement by 3% is (strongly) significant. The evalua- tion for the second data set, the ALoE data, revealed for the original SLM an overall prediction accuracy of 69%4. With the optimized parameter configuration the overall pre- diction accuracy rose by 2% to 71%. This yields a z-score of 2.1, which is statistically significant as well. In the following, the single parameter values of the con- figuration which yielded the highest accuracy will be men- tioned briefly. Surprisingly, a propagation depth of one produced almost the same accuracy as no propagation. However, a depth of two or more worsened the result. This would mean that we cannot really transfer the students’ knowledge level of one concept to related concepts except to directly adjacent ones (without any gain, though). For weighing direct and indirect evidences a ratio of 12: 1 was obtained. As expected, the value of direct evidences is con- siderably higher. Furthermore, keeping a maximum of six evidences per competency produced the highest accuracy. So did the interval [0.14, 0.86] for the difficulty mapping, which means that the easiest exercises get a difficulty value of 0.14 instead of 0. Finally, a discrimination factor of 9.6 showed to be best. 5 Related Work Finding optimal parameter configurations is a problem many student models have to tackle. The parameters may belong to Bayesian networks, to metadata of exercises or concepts (also known as rules), to the working mechanism of a student model itself or to any other part related to the student model. Corbett and Anderson conducted sev- eral experiments to evaluate their knowledge tracing stu- dent model. From the resulting data they fit the weights 4The reason why the prediction accuracy on the ALoE data is on average about 10% lower compared to the ATuF data is that the metadata quality of the ALoE data is worse and the number of difficult exercises is higher. for learning and performance parameters of each rule and student [Corbett and Anderson, 1995]. Cen et al. have shown that using an educational data mining technique called Learning Factors Analysis (LFA) they could im- prove the learning efficiency by 12% [Cen et al., 2007]. For this, they optimized knowledge tracing parameters of the Cognitive Geometry Tutor based on older log data. In the experiment with the optimized model students required on average 12% less time to reach the same level. Gong, Beck and Heffernan compared two techniques for student modeling, Knowledge Tracing and Performance Factor Analysis (PFA), in terms of their predictive accu- racy and parameter plausibility [Gong et al., 2010]. For fitting the knowledge tracing model, they used and com- pared two different techniques: expectation maximization and brute force. To minimize the problem of local max- ima, they restart their fitting algorithm multiple times as well. In their work, Gong et al reported that the brute force method was not as good as the expectation maximization method. Brute force is a very expensive and – because of the continuous state space – an infeasible model fitting method. Random-restart hill climbing (RRHC) is a method in between expectation maximization and brute force. In a brute force manner it evaluates and tries out different pa- rameter configurations. However, it does not evaluate an equally distributed sampled set of configurations, but in- stead it modifies parameters stepwise in the direction of expected improvement until a maximum is reached. Local search methods, like hill climbing, are often used for finding solutions efficiently, especially for NP- hard problems like the traveling salesman problem [Aarts and Lenstra, 2003]. Random-restart or multiple-restart is a known solution for increasing the probability to ob- tain a global maximum instead of a local one. The us- age of dynamically varying step sizes is also not new: it has been already reported in [Yuret and de la Maza, 1993; Miller, 2000]. Basically, any local search algorithm could have been used. RRHC, however, is because of its sim- plicity and straight forwardness the first choice for being adopted for optimizing several student model parameters. Other possible techniques are, for example, simulated an- nealing or genetic algorithms. This work used the achievement prediction accuracy to measure the performance of student models. This is a common approach. Corbett and Anderson, for exam- ple, used the prediction accuracy to evaluate their knowl- edge tracing student model used in the ACT Programming Tutor [Corbett and Anderson, 1995]. Their performance prediction after every step depends on (1) the probabil- ity that a rule is in the learned state, (2) a slip parameter and (3) a guess parameter for the according rule. Des- marais and Gagnon compared the prediction accuracy of two cognitive student models: one based on a standard Bayesian network approach and the other on a more con- strained one [Desmarais and Gagnon, 2006]. Previous versions of student models of ACTIVEMATH were evalu- ated by the prediction accuracy as well [Faulhaber, 2007; Faulhaber and Melis, 2008]. 6 Conclusions and Future Work This paper presented how the parameters of ACTIVE- MATH’s student model had been optimized with the lo- cal search algorithm known as random-restart hill climb- ing. Several parameters were presented, which were then optimized by a modified version of the RRHC algorithm. 293 The modified RRHC uses dynamic step sizes and steadily switches between different parameters. The evaluation function replays log data to measure the achievement pre- diction accuracy of a configuration. The achievement prediction accuracy improved significantly by 3% for the ATuF data and by 2% for the ALoE data. Until this work, the parameters of ACTIVEMATH’s stu- dent model remained to be a manually chosen configura- tion. The evaluated parameter configuration did indeed im- prove SLM’s accuracy. Since the performance of the orig- inal SLM was already quite good, an increase by 2% and 3% is an acceptable improvement. Nevertheless, improve- ments by 5% and above would have been more convincing. The presented methodology can be generalized to opti- mize other user models as well. First of all, the set of pa- rameters which have to be optimized has to be fixed. Ad- ditionally, an evaluation method has to be defined — re- playing log data and measuring the prediction accuracy is just one possibility. Finally, RRHC can be used to find an optimized parameter configuration. In the future we plan to use log data obtained from other projects using different exercises with participants of the same age as well as more experienced participants. This might help us to find out whether the obtained configura- tion is specific to the young learners or to the used exercises domain (which was fraction mathematics in both experi- ments, ATuF and ALoE). We want to further analyze the necessity of propagation and possible differences between age groups. Acknowledgements This article results from the ATuF project (ME 1136/5- 2) funded by the German National Science Foundation (DFG). References [Aarts and Lenstra, 2003] Emile L. Aarts and Jan K. Lenstra. Local search in combinatorial optimization. Princeton Univ Pr, 2003. [Cen et al., 2007] Hao Cen, Kenneth R Koedinger, and Brian Junker. Is over practice necessary? –improving learning efficiency with the cognitive tutor through edu- cational data mining. In Proceeding of the 2007 confer- ence on Artificial Intelligence in Education, pages 511– 518, Amsterdam, The Netherlands, 2007. IOS Press. [Corbett and Anderson, 1995] Albert T Corbett and John R Anderson. Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4:253–278, 1995. [Desmarais and Gagnon, 2006] Michel C Desmarais and Michel Gagnon. Bayesian Student Models Based on Item to Item Knowledge Structures, pages 111–124. Springer-Verlag Berlin Heidelberg, 2006. [Doost, 2010] Ahmad Salim Doost. Enhancement of ac- tivemath’s student model. Master’s thesis, Saarland Uni- versity, 2010. [Eichelmann et al., 2008] Anja Eichelmann, Susanne Nar- ciss, Arndt Faulhaber, and Erica Melis. Analyzing computer-based fraction tasks on the basis of a two- dimensional view of mathematics competences. In EARLI SIG 6&7, pages 125–134. Springer, 2008. [Faulhaber and Melis, 2008] Arndt Faulhaber and Erica Melis. An efficient student model based on student performance and metadata. In Nikos Avouris, Nikos Fakotatis, Constantine D. Spyropoulos, and Malik Ghal- lab, editors, Proceedings of 18th European Conference on Artificial Intelligence. 18th European Conference on Artificial Intelligence (ECAI-08), July 21-25, Patras, Greece, volume 178 of Frontiers in Artificial Intelli- gence and Applications, FAIA, pages 276–280. Univer- sity of Patras, IOS Press, 2008. [Faulhaber, 2007] Arndt Faulhaber. Building a new learner model for activemath combining transferable belief model and item response theory. Master’s thesis, Saar- land University, 2007. [Flynt and Morton, 2009] Samuel W Flynt and Rhonda Collins Morton. The teacher shortage in america: Pressing concerns. National Forum of Teacher Education Journal, 19(3), 2009. [Gong et al., 2010] Yue Gong, Joseph E Beck, and Neil T Heffernan. Comparing Knowledge Tracing and Perfor- mance Factor Analysis by Using Multiple Model Fitting. 2010. [Ingersoll and Perda, 2009] Richard M Ingersoll and David Perda. The mathematics and science teacher shortage: Fact and myth. Technical Report CPRE Research Report #RR-62, The Consortium for Policy Research in Education, March 2009. [Kohlhase, 2006] Michael Kohlhase. OMDoc – An Open Markup Format for Mathematical Documents [version 1.2]: Foreword by Alan Bundy. Lecture Notes in Com- puter Science. Springer-Verlag New York, Inc., Secau- cus, NJ, USA, 2006. [Lord, 1980] Frederic M. Lord. Applications of item re- sponse theory to practical testing problems. Hillsdale NJ: Erlbaum, 1980. [Miller, 2000] Ronald E Miller. Optimization: Founda- tions and Applications. John Wiley & Sons, 2000. [Russel and Norvig, 2003] Stuart Russel and Peter Norvig. Artificial Intelligence: A Modern Approach, pages 110–114. Prentice Hall, 2 edition, 2003. [Tsovaltzi et al., 2009] Dimitra Tsovaltzi, Erica Melis, Bruce McLaren, Michael Dietrich, George Goguadze, and Ann-Kristin Meyer. Erroneous examples: A pre- liminary investigation into learning benefits. In Ulrike Cress Vania Dimitrova Marcus Specht, editor, Proc. of the First European Conference on Technology Enhanced Learning (EC-TEL 2009), volume LNCS 5794 of Lec- ture Notes in Computer Science, Heidelberg, 2009. Springer-Verlag. [Ullrich and Libbrecht, 2008] Carsten Ullrich and Paul Libbrecht. Educational Services in the ActiveMath Learning Environment, pages 211–236. The Future of Learning. IOS Press Amsterdam, March 2008. [Yuret and de la Maza, 1993] Deniz Yuret and Michael de la Maza. Dynamic hill climbing: Overcoming the limitations of optimization techniques. In In The Second Turkish Symposium on Artificial Intelligence and Neural Networks, pages 208–212, 1993. 294 On the Role of Social Tags in Filtering Interesting Resources from Folksonomies Daniela Godoy ISISTAN Research Institute, UNICEN University Campus Universitario, CP 7000, Tandil, Argentina Also at CONICET, Argentina dgodoy@conicet.gov.ar Abstract Social tagging systems allow users to easily cre- ate, organize and share collections of resources (e.g. Web pages, research papers, photos, etc.) in a collaborative fashion. The rise in popu- larity of these systems in recent years go along with an rapid increase in the amount of data con- tained in their underlying folksonomies, thereby hindering the user task of discovering interest- ing resources. In this paper the problem of fil- tering resources from social tagging systems ac- cording to individual user interests using purely tagging data is studied. One-class classification is evaluated as a means to learn how to iden- tify relevant information based on positive ex- amples exclusively, since it is assumed that users expressed their interest in resources by annotat- ing them while there is not an straightforward method to collect non-interesting information. The results of using social tags for personal clas- sification are compared with those achieved with traditional information sources about the user in- terests such as the textual content of Web doc- uments. Finding interesting resources based on social tags is an important benefit of exploiting the collective knowledge generated by tagging activities. Experimental evaluation showed that tag-based classification outperformed classifiers learned using the full-text of documents as well as other content-related sources. 1 Introduction Social tagging systems have grown in popularity on the Web in the last years on account of their simplicity to cat- egorize and retrieve shared content using open-ended tags. In sites such as Del.icio.us1, Flickr2 or CiteULike3, users annotate a variety of resources (Web pages, blog posts or pictures) using a freely chosen set of keywords, which fa- cilitates later search and retrieval of such contents. Folksonomies [Mathes, 2004] are the primary structure of the novel social classification scheme introduced by tag- ging systems, which relies on the convergence of the tag- ging efforts of a large community of users to a common categorization system that can be effectively used to orga- nize and navigate a massive amount of freely accessible, user contributed and annotated Web resources. 1http://del.icio.us/ 2http://www.flickr.com/ 3http://www.citeulike.org/ In spite of the novel mechanisms for searching and re- trieving resources provided by collaborative tagging sys- tems, the rapid increase in size of communities using these systems as well as the large amount of shared content avail- able make the discovery of relevant resources a time con- suming and difficult task for users. This problem is aggra- vated by the completely unsupervised nature of social tags, resulting in ambiguity, noise, etc.; which may reduce their effectivity in content indexing and searching. The goal of this paper is to study the utility of social tags as a source of information for filtering resources from folksonomies according to the user interests. In social tag- ging systems resources receive tag assignments by mem- bers of the community, describing their content in a col- lective sense. Thus, it can be assumed that users are likely to be interested in additional content annotated with simi- lar tags to the ones collectively assigned to resources they showed interest in before. Social tags associated to the resources annotated by the user can be used to build a user interest profile that, in turn, can be applied to filter further incoming information from tagging systems (e.g. RSS feeds). From a user perspective, social tags can be thought of as indicators of user aware- ness and potential interest in a given resource [Arakji et al., 2009], allowing users to capitalize on the associations made by persons who have assigned similar tags to other resources. In order to identify interesting resources, tag-based clas- sifiers are learned using the resources users annotate and have in their personomies, the tag collection of a single user, as positive examples of their interests. This is a spe- cial case of classification in which it is necessary to de- termine whether an example (resource) belongs to a target class (interesting) when only examples of the target class are given, which is known as one-class classification. The rest of the paper is organized as follows. Section 2 gives an overview of one-class classification using Support Vector Machines (SVM) classifiers. Section 3 describes the dataset used for experimentation, gathered fromDel.icio.us bookmarking site. The empirical analysis carried out to compare content-based and tag-based classification of Web pages in personomies is presented in Sections 4 and 5, re- spectively. Section 6 reviews related research. Finally, em- pirical findings are summarized in Section 7. 2 One-class Classification User actions of assigning tags to resources are a strong indication of relevance about its content. Consequently, positive examples of the user interests can be easily col- lected from folksonomies. On the contrary, it would be 295 hard to identify representative negative examples or non- interesting resources since users might not tag a potentially interesting resource because of multiple reasons, such as not knowing about the existence of the resource, lack of time to tagging or even reading it, etc. The task of determining whether a document is interest- ing for a user basing training only on positive examples can be seen as a one-class classification problem. One-class classification differs in one essential aspect from conven- tional classification as it assumes that only information of one of the classes, the target class, is available. The idea is to define a boundary between the two classes estimated from data belonging to the relevant class, such that it ac- cepts as much of the target objects as possible while mini- mizes the chance of accepting outlier objects. SVMs (Support Vector Machines) are a useful technique for data classification, which has been shown that is per- haps the most accurate algorithm for text classification, it is also widely used in Web page classification. Schölkopf et al. [Schölkopf et al., 2001] extended the SVM method- ology to handle training using only positive information and Manevitz et al. [Manevitz and Yousef, 2002] apply this method to document classification and compare it with other one-class methods. Essentially, one-class SVM algorithm consists in learn- ing the minimum volume contour that encloses most of the data and it was proposed for estimating the support of a high-dimensional distribution [Schölkopf et al., 2001], given a set of training vectors X = {x1, . . . , xl} in Rn. The aim of SVM is to train a function fX : Rn → R such that most of the data in X belong to the set RX = {x ∈ Rn with fX (x) ≥ 0} while the volume of RX is minimal. This problem is termed minimum volume set (MVS) estimation, and the membership of x to RX indi- cates whether this data point is overall similar to X . One-class SVM solves MVS estimation by first mapping the data into a feature space H using an appropriate kernel function φ : Rn → H which transforms training examples to another space. Here, the Gaussian RBF kernel is used, formulated as exp [ −γ ‖xi − xj‖2 ] , where ‖.‖ denotes the Euclidean norm in Rn. For training, a certain number of data points of the positive class are treated as if they be- long to the negative class. SVM approach proceeds in H by determining the hyperplane W that separates most of the data from the hypersphere origin, separating a certain percentage of outliers from the rest of the data points. In order to separate the data points from the origin, the following quadratic programming problem needs to be solved: min w,ξ,ρ 1 2 wTw − ρ+ 1 νl l∑ i=1 ξi subject to wTφ (xi) ≥ ρ− ξi and ξi ≥ 0, i = 1, 2, . . . , l where ξi are so-called slack variables and υ (Nu) tunes the fraction of data that are allowed to be on the wrong side ofW , this parameter defines the trade-off between the per- centage of data points treated as belonging to the positive and negative classes. Then a solution is such that αi verify the dual optimization problem: min α 1 2 αTQα (1) subject to 0 ≤ αi ≤ 1/ (υl) , i = 1, . . . , l eTα = 1 where Qij = K (xi, xj) ≡ φ (xi)T φ (xj). In this work we used LibSVM4 [Chang and Lin, 2001] library which solves a scaled version of 2 as follows: min 1 2 αTQα (2) subject to 0 ≤ αi ≤ 1, i = 1, . . . , l eTα = νl Finally, the decision function is: sgn ( l∑ i=1 αiK (xi, x)− ρ ) In order to adjust the kernel for optimal results, the pa- rameter γ need to be tuned to control the smoothness of the boundary, i.e. large values of γ lead to flat decision bound- aries. The setting of this parameter is initially set to γ = 0, variations of this value are then discussed in Section 5. 3 Dataset Description Emerging social structures in tagging systems, also known as folksonomies, can be defined as a tuple F := (U, T,R, Y,≺) which describes the users U , resources R, and tags T , and the user-based assignment of tags to re- sources by a ternary relation between them, i.e. Y ⊆ U × T × R [Hotho et al., 2006]. The collection of all tag assignments of a single user constitute a personomy, i.e. the personomy Pu of a given user u ∈ U is the restric- tion of F to u, i. e., Pu := (Tu, Ru, Iu,≺u) with Iu := {(t, r) ∈ T ×R |(u, t, r) ∈ Y }, Tu := pi1 (Iu), Ru := pi2 (Iu), and ≺u:= {(t1, t2) ∈ T × T |(u, t1, t2) ∈≺}, where pi1 the projection on the ith dimension. Empirical evaluation was carried out using data collected from Del.icio.us5 social bookmarking system. From this site 50 complete personomies were gathered from differ- ent users appearing in the main page. Each personomy includes all of the user bookmarks and the corresponding tag assignments. In this collection of personomies there are users with as few as 10 and as much as 2521 book- marks. For each Web page, in turn, all tags assigned by other members of the community were also extracted from Del.icio.us, obtaining the full tagging activity (FTA) or an- notations related to each resource. From the total set of resources gathered from Del.icio.us site, experiments reported in this paper were performed over English-written pages, identified using the classifica- tion approach presented in [Cavnar and Trenkle, 1994]. This allows to apply language-dependent pre-processing 4http://www.csie.ntu.edu.tw/~cjlin/ libsvm/ 5http://del.icio.us/ 296 tasks to both texts and social tags. The resulting folkson- omy counts with |U | = 50 users, |T | = 233.997 tags and |R| = 49.265 bookmarks or Web pages, related by a to- tal of |Y | = 128.642.112 tag assignments. Table 1 sum- marizes the main statistics of this collection of Web pages averaged by personomy. It includes the number of unique terms in the full-text of resources belonging to the differ- ent personomies as well as in the text of anchors and titles. Also contains the number of tags assigned by members of the community to the resources of each user, considering the overall top 10 tags and the full tagging activity. The ef- fect produced in these numbers by two tag filters explained in Section 5 is also detailed. The average numbers of each element correspond to the number of features classifiers have to deal with during learning. In all experiments reported in this paper, evaluation was carried out using a holdout strategy that split data into a 66% for training and a 34% for testing. In order to make the results less dependent of the data splitting, in all ex- periments the average and standard deviation of 10 runs for each user is reported. This is, each personomy was divided into a training set used to learn the classifier and a testing set used to assess its validity. Since this testing set only contains interesting examples, uninteresting pages were extracted from the personomies of other users to eval- uate the algorithm capacity of distinguishing uninteresting resources. This is, the testing set was created using the test set from the user and an equivalent number of Web pages gathered from a different personomy in the collection. This second personomy was randomly chosen among those pre- senting no resource intersection with the current user. In other words, it is assumed that two users having no com- mon resources in their personomies do not shared interests, so that one user resources will be uninteresting to the other one. Although this is not strictly true, it can be considered as an approximation to obtain a negative set for testing. For evaluating the classifiers, the standard precision and recall summarized by F-measure as well as accuracy were used and error-bars indicate standard deviations [Baeza-Yates and Ribeiro-Neto, 1999]. 4 Content-based Classification Content is one of the main sources of information for de- termining the relevance of Web pages for users. It is as- sumed that similar contents to those previously seen by the user will be also interesting. In order to establish the rela- tive importance of content and social tags in personal Web page classification, the performance of one-class classifi- cation over textual elements obtained from documents was first evaluated so that it can be used as baseline for compar- ing the performance of tag-based classifiers. Web page texts were filtered using a standard stop-word list and the Porter stemming algorithm [Porter, 1980] was applied to the remaining terms. Figure 1 shows the results of training classifiers for identifying interesting Web pages using different textual sources such as the full text of doc- uments, the anchor text attached to hyperlinks (i.e. the vis- ible, clickable text in a hyperlink) belonging to the page and the page title. Each of these elements is extracted from pages belonging to a user personomy to learn a classifier for such user. F-measure scores achieved with different values of ν (Nu) parameter of one-class classifiers are showed in the figure. Classification using full-text obtained the best results, closely followed by the text from anchors. The title of re- 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.1 0.2 0.3 0.4 0.5 0.6 F− M ea su re ν (Nu) Full texts Anchor texts Titles Figure 1: F-measure scores using different textual sources from the content of resources for classification sources alone, however, did not result to be a good source for filtering interesting information. Naturally, the rela- tively low scores of F-measure is caused by the absence of negative information during learning. In addition, in the negative testing set might be some interesting pages, due to possible violations of the assumption that users do not shared interests if their personomies do not intersect each other, for which the prediction is correct but taken as an error. Nevertheless, text classifiers were still able of recog- nizing part of the user interests and are a valuable source for filtering a stream of incoming information (e.g. a RSS feed). 5 Social-based Classification Social tagging systems on the Web own their success to the opportunity of freely determining a set of tags for a resource without the constraint of a controlled vocabulary, lexicon or pre-defined hierarchy [Mathes, 2004]. However, the free-form nature of tagging also leads to a number of vocabulary problems. Among the reasons producing tags variations are [Golder and Huberman, 2006; Tonkin and Guy, 2006; Echarte et al., 2008]: • inconsistently grouping of compound words consist- ing of more than two words. Often users insert punc- tuation to separate the words, for example ancient- egypt, ancient_egypt and ancientgypt; • use of symbols in tags, symbols such as #, -, +, /, :, _, &, ! are frequently used at the beginning of tags to cause some incidental effect such as forcing the in- terface to list some tag at the top of an alphabetical listing; • morphological problems given by the use of singular, plural or other derived forms of words. For example, blog, blogs and blogging. To prevent syntactic mismatches due to these reasons the effect of different filtering strategies for tags was evaluated. First, original raw tags were filtered to remove the symbols mentioned before, allowing to join compound words at the same time. Then, the remaining tags were stemmed to their morphological roots using Porter stemming algorithm. In this study the overall top 10 tags associated to re- sources in the folksonomy, this is the 10 more frequent tags per resource, were evaluated as a source for classification and compared with the use of the complete set of tags as- signed for users to such resources, also known as the full 297 Min Max Average ± SD Total # full-text terms 1.997 115.585 47.138,14 ± 29.063,53 2.356.907 # anchor text terms 681 62.521 24.632,82 ± 16.268,12 1.231.641 # title terms 31 4.581 1.806,44 ± 1.236,02 90.322 # tags in the top 10 lists 57 4.117 1.739,76 ± 1.097,52 86.988 # tags in the top 10 lists after filtering symbols 57 3.882 1.686,14 ± 1.055,23 84.307 # tags in the top 10 lists after stemming 55 3.462 1.495,68 ± 934,83 74.784 # tags in the FTA 150 10.902 4.679,94 ± 3.053,78 233.997 # tags in the FTA after filtering symbols 141 9.872 4.328,46 ± 2.791,12 216.423 # tags in the FTA after stemming 122 8.678 3.757,00 ± 2.426,63 187.856 Table 1: Summary of Web page statistics per personomy in the dataset used for experimentation 0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58 0.1 0.2 0.3 0.4 0.5 0.6 F− M ea su re ν (Nu) Raw tags Filtered tags Stemmed tags Figure 2: F-measure scores using frequency-based repre- sentations of the top 10 tags associated to resources for classification tagging activity of the resource. Frequency-based and bi- nary representations of the resulting tag vectors were also considered and compared. Binary vectors were constructed to indicate the occurrence or non-occurrence of a given tag in the list of tags a Web page is annotated with. Frequency vectors indicate the number of users that employ a given tag to annotate the resource; this is fij is the frequency of usage of tag i for the resource j, these vectors are normal- ized according to their length. 5.1 Results using top-10 tags Figure 2 depicts F-measure scores achieved with one-class SVM classifiers leaned using the top 10 list of tags. In the figure, results are shown for raw tags as well as tags re- sulting of applying the mentioned filtering strategies, first symbol removal and then stemming. In regards to the tag filtering operations it can be deduced according to these re- sults that removing symbols and joining compound words slightly diminish the performance of classifiers, whereas stemming improves it. Note that the filters were applied in the previously mentioned order, so that stemming can potentially achieve better results if applied over raw tags directly. In general terms, the results of using binary representa- tion for tag vectors, which are shown in Figure 3, provides a significant improvement over normalized frequency vec- tors. In this representation scheme, removing symbols and joining compound words reduce the noise of tags resulting in an improvement of F-measure scores. However, the use of stemming does not lead to further improvements, even damaging the performance of classifiers. 0.45 0.5 0.55 0.6 0.65 0.7 0.1 0.2 0.3 0.4 0.5 0.6 F− M ea su re ν (Nu) Raw tags Filtered tags Stemmed tags Figure 3: F-measure scores using binary representations of the top 10 tags associated to resources for classification 5.2 Results using full tagging activity Figures 4 and 5 depict the results using the same configu- ration of experiments but applied to vectors resulting of the full tagging activity attached to resources. F-measure scores of one-class SVM classifiers learned using frequency vectors, depicted in Figure 4, shows that the filter used to remove symbols and join compound words does not improve significantly the performance of classi- fiers whereas stemming obtains slightly enhancements. In turn, binary representations again outperformed frequency- based ones and filters attain small performance enhance- ments over raw tags. 5.3 Summary of results Figure 6 summarizes the results obtained for full-text clas- sification of Web resources and tag-based classification us- ing both the top 10 tags of each resource and its full tag- ging activity in their frequency-based and binary represen- tations. It can be observed that the results of using the main source of information about the content of a resource, which is the text of the resource itself, is consistently out- performed by the use of social tags when a binary repre- sentation of tag vectors is applied. Classification based on frequency-based representations of the full tagging activity of resources also reached better performance than full-text classification. In contrast, the use of the 10 more frequent tags used to annotate resources as a means for classification exhibit inferior performance in identifying interesting Web pages for users. Figure 7 shows the results obtained with the same clas- sification sources by setting ν to 0.1, the point at which 298 0.48 0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.1 0.2 0.3 0.4 0.5 0.6 F− M ea su re ν (Nu) Raw tags Filtered tags Stemmed tags Figure 4: F-measure scores using frequency-based repre- sentations of the full tagging activity for classification 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.1 0.2 0.3 0.4 0.5 0.6 F− M ea su re ν (Nu) Raw tags Filtered tags Stemmed tags Figure 5: F-measure scores using binary representations of the full tagging activity for classification 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 F− M ea su re ν (Nu) Full texts FTA (Frequency) Top 10 tags (Frequency) FTA (Binary) Top 10 tags (Binary) Figure 6: F-measure scores considering content and social tags as sources for classification 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.2 0.4 0.6 0.8 1 F− M ea su re γ (Gamma) Full texts FTA (Frequency) Top 10 tags (Frequency) FTA (Binary) Top 10 tags (Binary) Figure 7: F-measure scores for variations of γ (gamma) parameter of one-class SVM classifiers the best results were achieved, and varying the value of γ of one-class SVM classifiers. The figure not only shows how higher values of γ lead to small increases in F-measure scores but, more importantly, with values of γ > 0.4 any form of representation of social tags outperforms full-text classification. Furthermore, binary representations of top- 10 tags associated to resources become the best performing among the social classification schemes. It is worth mentioning that full-text is used in these ex- periments as baseline for comparison, but this source of information is not always available in social tagging sys- tems in which resources can be a variety of things, such as images, music, bibliographic references, etc. In these situations, classification must entirely rely on social tags. Thus, it can be concluded that collective knowledge lying in folksonomies becomes a valuable source of information for automatic, personal classification of Web resources. Learning classifiers using collaboratively assigned tags also impacts on the dimensionality of the classification problem. In fact, tag-based classifiers extracted from the top 10 list of tags are learned in a smaller dimensional space than full-text classifiers and yet are better predictors as can be observed in the last results reported. Table 1 summa- rized the number of unique features, terms or tags accord- ing to the case, the classification problem have to deal with. Figure 8 summarizes the performance of content and tags-based classifiers in terms of accuracy for υ = 0.1, the parameter setting leading to the best results in most ex- periments. Confirming previous results, if the classifiers capability of making correct decisions is considered, tags- based classifiers outperformed full-text ones. Also, among tag-based classifiers those using for training the top 10 tags assigned to each resource were the ones of superior per- formance. Thus, top 10 tags offers good accuracy levels and, at the same time, an important reduction in leaning and prediction complexity given the smaller size of the di- mensional space. Finally, the incidence of the different sources of infor- mation for filtering Web pages, content or social tags, is analyzed according to the size of personomies in Figure 9. For studying this aspect of classification, the 50 users were divided in five groups according to the amount of resources in their personomies. In the first group users having less than 300 annotated resources were placed, then users hav- ing from 300 to 600 resources, 600 to 1000, 1000 to 2000, and more than 2000 resources. 299 45 50 55 60 65 70 75 80 85 Full−text FTA (Freq.) Top 10 tags (Freq.) FTA (Binary)Top 10 tags (Binary) Ac cu ra cy Figure 8: Accuracy of content and social tags classifiers 0.3 0.4 0.5 0.6 0.7 0.8 0.9 <300 300−600 600−1000 1000−2000 >2000 F− M ea su re Full−text FTA (Binary) Top 10 tags (Binary) Figure 9: F-measure scores obtained for different person- omy sizes Naturally, as the number of annotated resources grows classifiers becomes increasingly better in filtering interest- ing Web pages as more information about the user interests is available during learning. However, the difference be- tween social tags and full-text classifiers is more noticeable in smaller personomies, in which tags outperform text by a wider margin. 6 Related Works Many works had approached the problem of tag rec- ommendation in social tagging systems [Lipczak, 2008; Symeonidis et al., 2008; Jäschke et al., 2007; Milicevic et al., 2010], however the problem of filtering resources ac- cording to the user interest had received less attention. To the best of our knowledge, no approaches have addressed this problem using one-class classification over social tags. Vatturi et al. [Vatturi et al., 2008] create a personalized ed tag-based recommender for each user consisting of two NB classifiers trained over different time-frame. One clas- sifier predicts the user current interest based on a shorter time interval and the other classifier predicts the user gen- eral interest in a bookmark considering a longer time in- terval. If any classifier predicts the bookmark as interest- ing, it is recommended. The user study results show that the tag-based recommender performs well with real data using tags from an enterprise social bookmarking system. In [Ammari and Zharkova, 2009] an approach for filter- ing the blog posts that search engines retrieve is presented. SVM is used to train and build a predictive model for the targeted user; the retrieved posts are analyzed and classi- fied by the predictive model. Finally, only the posts that are scored as relevant by the model are sent back to the user. Tag-based profiling consisting on tag vectors in which tag weights are given by their frequency of occurrence in the resources a user tagged had been proposed in [Noll and Meinel, 2007]. In [Michlmayr and Cayzer, 2007], profiles are represented by graphs in which nodes correspond to tags and edges denote relationships between them. The idea of using semantic relationships among tags in tag- based profiles has also been explored in [Huang et al., 2008]. In the work presented in this paper, one-class SVM classifiers can be seen as tag-based profiles for users. The value of the collective knowledge encapsulated in social tags for classification of resources in general direc- tories or categories was studied in several works, not from a personal perspective as in this work, but from a social point of view. Zubiaga et al. [Zubiaga et al., 2009] explore the use of Support Vector Machines (SVM) in the Social-ODP-2k9 dataset, which links Web pages and tags assigned to them in Del.icio.us with their corresponding categories in a Web directory such as the Open Directory Project (ODP)6. In this work additional resource meta-data such as notes and reviews were also evaluated besides the tagging activity. Tags in conjunction with comments achieved good results for Web page classification. Noll and Meinel [Noll and Meinel, 2008b] study and compare three different annotations provided by readers of Web documents, such as social annotations, hyperlink an- chor texts and search queries of users trying to find Web pages, for classification. Coincidentally with our finding in the context of personal Web page classification, the re- sults of this study suggest that tags seem to be better suited for classification of Web documents than anchor words or search keywords, whereas the last ones are more use- ful for information retrieval. In a further study [Noll and Meinel, 2008a], the same authors analyzed at which hierar- chy depth tag-based classifiers can predict a category using the ODP directory. It was concluded that tags may perform better for broad categorization of documents rather than for narrow categorization. Thus, classification of pages in cat- egories at inferior hierarchical levels might require content analysis. 7 Conclusions In this paper the role of social tags in filtering resources from folksonomies according to the interests of individual users was empirically analyzed. One-class classification was used to learn the user interests from diverse content sources (such as the full-text, anchor texts and titles) and social tagging sources (top 10 list of all tags associated to resources and their full tagging activity). Then, the extend to which each source can contribute to automatic, personal Web document classification was evaluated and compared. Experimental results obtained with a set of personomies extracted from Del.icio.us bookmarking system showed that tag-based classifiers outperformed content-based ones. Some tag filters such as removal of symbols, joint of com- pound words and reduction of morphological variants have 6http://www.dmoz.org/ 300 a discrete impact on classification performance. Interesting results were obtained using binary representations of tag vectors for learning and prediction. In this case, tag-based classifiers significantly improved the performance in filter- ing interesting results, even considering the top 10 tags as- signed to resources in a quite smaller dimensionality space. Acknowledgments This research was supported by The National Council of Scientific and Technological Research (CONICET) under grant PIP Nº 114-200901-00381. References [Ammari and Zharkova, 2009] A. N. Ammari and V. V. Zharkova. Combining tag cloud learning with SVM classification to achieve intelligent search for relevant blog articles. In Proceedings of the 1st International Workshop onMining Social Media, Sevilla, Spain, 2009. [Arakji et al., 2009] R. Arakji, R. Benbunan-Fich, and M. Koufaris. Exploring contributions of public re- sources in social bookmarking systems. Decision Sup- port Systems, 47(3):245–253, 2009. [Baeza-Yates and Ribeiro-Neto, 1999] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Longman Publishing, 1999. [Cavnar and Trenkle, 1994] W. Cavnar and J. Trenkle. N- gram-based text categorization. In Proceedings of 3rd Annual Symposium on Document Analysis and Informa- tion Retrieval, pages 161–175, Las Vegas, USA, 1994. [Chang and Lin, 2001] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie. ntu.edu.tw/~cjlin/libsvm. [Echarte et al., 2008] F. Echarte, J. Astrain, A. Córdoba, and J. Villadangos. Pattern matching techniques to iden- tify syntactic variations of tags in folksonomies. In Pro- ceedings of the 1st World Summit on The Knowledge Society (WSKS ’08), volume 5288 of LNCS, pages 557– 564. Springer-Verlag, 2008. [Golder and Huberman, 2006] S. Golder and B. Huber- man. Usage patterns of collaborative tagging systems. Journal of Information Science, 32(2):198–208, 2006. [Hotho et al., 2006] A. Hotho, R. Jáschke, C. Schmitz, and G. Stumme. Information retrieval in folksonomies: Search and ranking. In The Semantic Web: Research and Applications, 3rd European Semantic Web Confer- ence, ESWC 2006, volume 4011 of LNCS, pages 411– 426. Springer, 2006. [Huang et al., 2008] Y-C. Huang, C-C. Hung, and J. Yung- Jen Hsu. You are what you tag. In AAAI Spring Sym- posium on Social Information Processing (AAAI-SIP), pages 36–41, 2008. [Jäschke et al., 2007] R. Jäschke, L. Marinho, A. Hotho, L. Schmidt-Thieme, and G. Stumme. Tag recommen- dations in folksonomies. In Knowledge Discovery in Databases: PKDD 2007, volume 4702 of LNCS, pages 506–514, 2007. [Lipczak, 2008] M. Lipczak. Tag recommendation for folksonomies oriented towards individual users. In Proceedings of ECML PKDD Discovery Challenge (RSDC08), pages 84–95, Antwerp, Belgium, 2008. [Manevitz and Yousef, 2002] L. M. Manevitz and M. Yousef. One-class SVMs for document classi- fication. Journal of Machine Learning Research, 2:139–154, 2002. [Mathes, 2004] A. Mathes. Folksonomies - cooperative classification and communication through shared meta- data. Computer Mediated Communication, 2004. [Michlmayr and Cayzer, 2007] E. Michlmayr and S. Cayzer. Learning user profiles from tagging data and leveraging them for personal(ized) information access. In Proceedings of the Workshop on Tagging and Metadata for Social Information Organization, Banff, Alberta, Canada, 2007. [Milicevic et al., 2010] A. K. Milicevic, A. Nanopoulos, and M. Ivanovic. Social tagging in recommender sys- tems: A survey of the state-of-the-art and possible ex- tensions. Artificial Intelligence Review, 33(3):187–209, 2010. [Noll and Meinel, 2007] M. G. Noll and C. Meinel. Web search personalization via social bookmarking and tag- ging. In Proceedings of 6th International Semantic Web Conference (ISWC) and 2nd Asian Semantic Web Con- ference (ASWC), volume 4825 of LNCS, pages 367–380, 2007. [Noll and Meinel, 2008a] M. G. Noll and C. Meinel. Ex- ploring social annotations for Web document classifica- tion. In Proceedings of the 2008 ACM Symposium on Applied Computing (SAC ’08), pages 2315–2320, Fort- aleza, Ceará, Brazil, 2008. [Noll and Meinel, 2008b] M. G. Noll and C. Meinel. The metadata triumvirate: Social annotations, anchor texts and search queries. In IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pages 640–647, Sydney, Australia, 2008. [Porter, 1980] M. Porter. An algorithm for suffix stripping program. Program, 14(3):130–137, 1980. [Schölkopf et al., 2001] B. Schölkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Es- timating the support of a high-dimensional distribution. Neural Computation, 13(7):1443–1471, 2001. [Symeonidis et al., 2008] P. Symeonidis, A. Nanopoulos, and Y. Manolopoulos. Tag recommendations based on tensor dimensionality reduction. In Proceedings of the 2008 ACM Conference on Recommender Systems (Rec- Sys ’08), pages 43–50, Lausanne, Switzerland, 2008. [Tonkin and Guy, 2006] E. Tonkin and M. Guy. Folk- sonomies: Tidying up tags? D-Lib, 12(1), 2006. [Vatturi et al., 2008] P. K. Vatturi, W. Geyer, C. Dugan, M. Muller, and B. Brownholtz. Tag-based filtering for personalized bookmark recommendations. In Pro- ceeding of the 17th ACM Conference on Information and Knowledge Management (CIKM ’08), pages 1395– 1396, Napa Valley, California, USA, 2008. [Zubiaga et al., 2009] A. Zubiaga, R. Martínez, and V. Fresn. Getting the most out of social annotations for Web page classification. In Proceedings of the 9th ACM Symposium on Document Engineering (DocEng’ 2009), pages 74–83, Munich, Germany, 2009. 301 Abstract In this paper, we argue that Digital Object Mem- ories in the Internet of Things are closely related to partial user models in Personalization and that research can gain insights by analogy from both sides. We describe UbisMemory, a Semantic Web middleware for partial user models and dig- ital object memories. We describe the content re- presentation, the service and its technical issues. We argue that Digital Object Memories can be extended and merged with “Digital User Memo- ries” or life-long user models. We argue that Life-Logging for objects and for humans are closer related than expected. Introduction “Personalization and Recommendation on the Web and Beyond” is the new title of the ABIS workshop series. In this paper we are looking at the word “Beyond” and what could be meant by it. We move from the Web to the real world and return back: the so called “Internet of Things”. We look at how memories of users and memories of ob- jects differ, interact and become blurred. User modeling aspects have recently turned more and more into a broader view of “context-awareness”. This important focus on the context reveals that human-computer interaction takes place in different environments – with the immense in- crease of mobile technology, this context becomes more and more the real world itself. The desktop metaphor has been replaced by life-long user modeling in the real world. A second path to this paper comes from the other direc- tion. The concept of Digital Object Memories (DOM) has recently been introduced, see [1] or [7]. In the upcoming and fast growing research area “Internet of Things” (IoT), the “Digital Object Memories” play a central role to ena- ble its ideas, strategies and goals. Embedded systems, mostly based on Semantic Web Technologies, Context- Awareness and intelligent identification mechanisms enable the technical implementation. Intelligent, instru- mented environments are the foundation, where the digital object memories develop and grow. The dimension of storage or location of object memories is orthogonal to its options and possibilities that are enabled by the highly context aware environments. That means, at the service layer, it is not important if the digital object memory is stored directly locally with the object, or even if the iden- tification of the object is defined directly locally with the object. Important is only that each object can be identified uniquely by the service layer and that the object’s memory can be accessed somehow. In the internet everything that can be uniquely identified is called a resource. Thus we talk about memories attached to resources. Figure 1: Object Dimensions for Digital Memories of Users and Objects, in relation to the reseach areas If we abstract from technical issues like RFID, power supply, networking, sensors etc. at least two philosophical questions could arise between user models and digital object memories. a) If we look at the persuasion in artificial intelli- gence of “Materialism” and “Strong AI”, there is no difference concerning the potentiality of intel- ligence between a “dead” object and a “living” object like a human being. Thus, Digital (INANIMATE) Object Memories and Digital (LIVING) Object Memories can be considered as equivalent. The latter one has been researched since decades under the term of “user modeling”. b) A second philosophical issue might arise: do non-living objects get a personality as soon as they store their history in a DOM (of course with the help of the sensors and actuators in the intel- ligent environments)? To support the close rela- tion between User Modeling and Digital Object Memories we could have also taken an approach from the opposite direction and make the “ge- dankenexperiment” and personify dead objects as soon as they manage a memory. The statement User Models meet Digital Object Memories in the Internet of Things Dominikus Heckmann DFKI GmbH, Stuhlsatzenhausweg 3 66123 Saarbrücken, Germany Dominikus.Heckmann@dfki.de 303 by Tara Bloom “Memory maketh the man – we are who we are thanks to our experiences.” What matters for this paper and the UbisMemory service is the possibility of uniform handling of Digital Memo- ries, for objects and humans. Thus the user models meet the digital object on the technical layer, however if one is able to adapt the materialism’s point of view, also on the philosophical layer. One advantage of this approach is that we can later intro- duce a uniform handling of privacy issues, like the so called onion model. Important is the handling of “Digital Memories”, inde- pendent, if they describe an object, a human, or an imma- terial resource, like an entity on the web, or even a simu- lated resource in dual reality. The developed middleware mainly looks at the manage- ment of “Digital Memories”, independently if they are carried along, distributed in the environment or centrally stored The viewpoint matters and DOMs have a kind of “object- oriented” viewpoint: nor the software-system is in the center, that handles all information and that communicates with all users, neither the intelligent environment. In the center is only the object, looking out to the world and managing its “personal” memory. In Figure 1 we try to show that user modeling is closely related to the Internet of Things, if we manage to broaden the object dimensions. Farm Scenario to look at DOM issues Imagine you want to transform a farm into an intelligent environment and you want to apply the technology of digital object memories for a use case in the rural farming production of eggs. One question that arises is WHAT is the object? For which objects is it interesting to manage DOMs? Well it could be the hen, it could be the egg that will be eaten by the consumer. It could be the product, to say the box of eggs that will be bought by the consumer. It could be the chicken run. To say, it could actually be everything. It depends on the use case, the business case or the application. Looking through the glasses of “intelli- gent environments”: everything in the context is an object of interest, to be on the save side. Looking through the glasses of “user modeling”, especially the farmer is of interest. However, the farmer’s user model could be based on all the other objects memories, if we take the context into account. Interesting is the issue of knowledge representation and knowledge exchange in the Internet of Things, in compar- ison to the knowledge representation and knowledge ex- change in Personalization. The following situation serves as basis to discuss the issue of object memories in this paper: The hen “Lilli” has layed an egg yesterday in its barn in Elm, Germany. At that time is was 25 degrees Celsius. The Egg is packed with 5 others into a box, which is put on a pallet, which is moved by farmer Bob with a tractor from the farm „Erlenhof“ to the merchants place. Which parts of this information can be sensed by which sensors, which parts should be stored where? Should this information be distributed to the different objects or re- sources (hen, egg, box, pallet, farmer, farm …) that might collect data for their Digital Object Memories? Or should the description of the situation be stored in this intelligent farm environment. Or does it even make sense to intro- duce a centralized storage service and broker for partial digital object memories and partial user models? The rest of this paper is arranged as follows: we discuss the issue of storage of object memories in the next section, and introduce parts of the UbisMemory system in section three, followed by a concluding discussion. Storage of Object Memories A first question that arises is where should the object’s memory be stored? Well, an obvious solution is to attach it directly to the object itself. Interestingly, for the eggs in the farm scenario is that parts of the object memory are written on the eggs directly. One can read the date, the country, even the chicken run where the egg was laid (of course not in digital form). Figure 2: Local (A), Distributed (B), versus Central (C) Storage (IE = Intelligent Environment) What could be the advantage of a central DOM reposito- ry? Well imagine the egg is further processed, like a scrambled egg. What to do if the consumer gets an egg- related disease after a couple of days? In that case It would be especially important to find out all the DOM- details like “the chicken run” that produced the egg. In that scenario, the DOM gets interesting days after the object has been destroyed. If we only rely on A), where the DOM is stored directly at the object, we have a prob- lem. An additional C) DOM @ Central would be of great help. Another argument of a centralized DOM-archive is the possibility of a uniform user interface to inspect the memories and to set privacy issues. As conclusion, all three versions A), B) and C) have their rights of existence, their advantages and disadvantages. The UbisMemory service that is described hereafter fo- cuses firstly on the centralized C) and secondly on distri- buted B). UbisMemory UbisMemory mainly addresses the issue of “Content Sharing” within ubiquitous computing, that can be com- pared to [1]. The UbisMemory middleware is integrated into the test bed architecture called UbisWorld 3.0, see [8], for research on ubiquitous user modeling in the era of 304 Semantic Web and Web 2.0 facing the Internet of Things in ubiquitous computing. The basic storage idea is founded on SituationalState- ments that form a relation-based user model & context representation, see[5]. The exchange protocol is defined as the earlier introduced UserML, which is an XML & RDF-based exchange language. Interesting is the question of what will be exchanged? Apart from the “Mainpart”, also Meta Data are exchanged. Which meta data is important for ubiquitous user model- ing? - When and where is the statement valid? - Who claims this and which explanation is given? - What is the evidence and the confidence? - What will be done with the DOM/UM? - When will this information be deleted? - Who is the owner of this information? - What are the privacy settings? - How can the statement be uniquely identified? - Can the DOM/UM entries be grouped with others? Figure 3: onion model of statements with several layers of meta-data This meta-data can be used to define filters to tailor the parts of DOMs in the case of a DOM request. Instead of describing the full UbisMemory architecture, see [8] for more details, we pick out the following three topics and describe them in more detail: privacy issues, exploitation and the user interface. UbisMemory supports a fine grained semantic Location Modelling for DOMs. It is based on HUGO, the “Huge Ontology” with currently over 20 Million semantically described places. HUGO, see [7], is based on Linked Data, enriched with an Web 2.0 approach to add collabo- ratively new locations. It ranges from countries, cities, streets to individual houses, rooms and even shelves in furniture. Especially for detailed DOMs, a fine-grained location model is needed that goes beyond the GPS coor- dinates or the position with two dimensional coordinates, since interaction often takes place inside a house or a production place. DOM Exploitation To consider systems that integrate adaptation as exploita- tion of DOMs and the context is introduced in [9]. In [7] it is shown that complex applications can still be realized with cheap and weakly instrumented smart objects. This is done by establishing digital object memories. How can the UbisMemory be integrated into the full process loop of enabling smarter and more intelligent applications on the basis of digital object memories? Well, in order to exploit the Digital Object Memories, a full process loop of the following four steps have to be implemented. 1) Instrumentation (i.e of the farm, or the farmer) 2) Interpretation (inference from sensor data to memory) 3) Communication (exchange of DOMs) 4) Adaptation (exploitation of DOMs & context) Some techniques are discussed in [4] on how the interpre- tation from Sensor Data to Memories via LOGs and JOURNALS can be done. UbisMemory only applies to the third step of “communi- cation”. In this sense it serves as a broker of DOMs. They can be stored there & retrieved from there. Together with a privacy handling mechanism, a broker service has been established. Privacy Issues in UbisMemory The exploitation of the resulting memories need to be regulated. In the farm scenario: For example, who should be allowed to read the information on what fodder was given to the hen that layed the egg? We can support fine-grained privacy handling, since the SituatinalStatements apply the onion model which means that the mainpart information is only available if the pri- vacy filter has been passed. A detailed description can be found in [6]. UbisMemory’s User Interface We have developed a general user interface to inspect the digital object memories as well as partial life-long user models, (as discussed in section one). Figure 4 shows a screenshot. The user interface can be accessed 1 via http://ubisworld.org/statementstore/. Figure 4: ubis.StatementStore, the user inteface of UbisMemory to inspect and control the Digital Memo- ries of Objects and Users. 1 however certain access rights need to be requested first. 305 The key idea is to decouple every situational statement into its semantic roles and allow for each of them a filter- ing by selection. By this approach you can select every object, and every attribute value pair that describes the memory item in mind, together with its meta-level infor- mation like timestamp, location, and privacy information like the owner of this bit of information. CONCLUSION In this paper we argue that user models will eventually meet digital object memories in the internet of things. In [2] we have stated that: a comprehensive log of the user’s behavior together with corresponding context descriptions allows adaptive systems to learn about users, to identify their habits, and to improve the quality of user support. In addition, users can apply such knowledge to learn from others and about themselves. Now, how much more inter- esting and fruitful can it be if we log the objects beha- vior’s, situations and contexts with the overall approach of Digital Object Memories become together with the Inter- net of Things? We expect a better quality of object sup- port and a better quality of the objects (or products) them- selves. For example, if we transfer the onion model of statements from user model research to digital object memories. Highly interesting are social implications and philosophical issues if we assume that objects gain a cer- tain level of personality by their new memories. Instead of looking at Digital Object Memories only, we looked at “digital memories of objects”, while we extend the view of “object” to the three dimensions physical-digital, real- virtual, and inanimate-living. With this generalization, we are able to combine User Modeling and Context- Awareness even in Dual Reality scenarios under the inte- grating concept of “Digital Memories of Objects”. Thus, classical Digital Object Memories and Life-Long User Models can be treated in our approach with the same middleware. In this paper we tried to contribute to the four following topics: Memory Representation: which we realize with SituationalStatements as introduced in [5]. Memory Ar- chitectures: We pointed to UbisMemory, a middleware approach which also allows for the realization of object memory functionality. This includes infrastructures for the centralized or distributed organizing, storing, and broker- ing of object‐related information, however based on a remote infrastructure, not on the object itself. Privacy Aspects: Who “owns” the data stored in an object’s mem- ory, who can access/delete/correct it? How long must/should memory content be stored, and can trust be established for the object memory? Human Memory Access: This topic comprises technologies and concepts to make an object memory’s content accessible to human users. With the UbisMemory browser, called State- mentStore, we try to structure and relate the wide variety of diverse data that might be contained in the memory due to its open nature. The presented middleware architecture is integrated into an interesting test bed for research on ubiquitous user modeling in the era of Semantic Web and Web 2.0 facing the Internet of Things in ubiquitous com- puting. This UbisWorld 3.0 tool set can be tested online at www.ubisworld.org. What is “beyond” the Web? Well, the real world! However, the real world is currently in the process to become part of the Web by the movement of the Internet of Things. Interesting is to see the upcoming research of Personalization together with the Internet of “all” Things, real or virtual, alive and inanimate. ACKNOWLEDGMENTS This work was partially supported by the European 7th Framework Program project GRAPPLE (“Generic Res- ponsive Adaptive Personalized Learning Environment”): http://www.grapple-project.org. REFERENCES [1] Kröner, A.; Schneider, M.; Mori. J. 2009. A Frame- work for Ubiquitous Content Sharing. IEEE Perva- sive Computing, Vol. 8, No. 4, pp. 58-65. IEEE Press. [2] Wahlster, W., Kröner A., Schneider, M., and Baus, J. 2008. Sharing Memories of Smart Products and Their Consumers in Instrumented Environments. In it – In- formation Technology, Vol. 50(1), Special Issue on Ambient Intelligence, pp. 45-50, Oldenburg [3] Heckmann,D. , Schwartz,T., B. Brandherm, M. Schmitz, and M. von Wilamowitz- Moellendorff. GUMO - The General User Model Ontology. In Proc. of Int. Conf. on User Modeling, Edinburgh, UK, 428– 432, 2005. [4] Kröner, A, Heckmann, D, & Wahlster, W., 2006. SPECTER: Building, Exploiting, and Sharing Aug- mented Memories. In K. Kogure (Ed.), Workshop on Knowledge Sharing for Everyday Life 2006 (KSEL06) (pp. 9-16). Kyoto, Japan. [5] Heckmann, D. (2003b). Introducing situational state- ments as an integrating data structure for user model- ing, context-awareness and resource- adaptive com- puting. In ABIS2003, pages 283-286, Karlsruhe, Germany. [6] Heckmann, D. (2003a). Integrating Privacy Aspects into Ubiquitous Computing: A Basic User Interface for Personalization. In Krüger, A. and Malaka, R., editors, Artificial Intelligence in Mobile Systems (AIMS 2003), pages 106-110, Seattle, USA. in con- junction with the Fifth International Conference on Ubiquitous Computing. [7] Schneider, M., Kröner, A. "The Smart Pizza Packing: An Application of Object Memories," Proc. 4th Int'l Conf. Intelligent Environments (IE 08), Inst. Eng. and Technology, 2008. [8] Heckmann, D. Matthias Loskyll, Rafael Math, Pascal Recktenwald, Christoph Stahl. UbisWorld 3.0: a Se- mantic Tool Set for Ubiquitous User Modeling, Demonstration description in online proceedings of First International Conference on User Modeling, Adaptation, and Personalization (UMAP 2009) [9] Plate C. et al., "Recomindation: New Functions for Augmented Memories," Proc. Adaptive Hypermedia and Adaptive Web-Based Systems (AH 06), Springer, 2006, pp. 141–150. 306 Abstract Users return to Web pages for various reasons. Apart from pages visited due to backtracking, users typically monitor a number of favorite pag- es, while dealing with tasks that reoccur on an in- frequent basis. In this paper, we introduce a nov- el method for predicting the next revisited page in a certain user context that, unlike existing me- thods, doesn’t rely on machine learning algo- rithms. We evaluate it over a large data set com- prising the navigational activity of 25 users over a period of 6 months. The outcomes suggest a significant improvement over methods typically used in this context, thus paving the way for ex- ploring new means of improving user’s naviga- tional support. 1 Introduction Nowadays, millions of people browse the Web every second, navigating from site to site and producing massive amounts of navigational log data. These data have the intrinsic potential to provide a solid basis for understand- ing individual user’s behavior. Modeling users is the first step towards this direction, serving as a foundation for developing recommendation and prediction techniques. Many applications can benefit from effective methods of user modeling, like Web search and personaliza- tion/recommendation systems, to name but a few. For example, predictive models have improved the ranking of web search engine results, by computing the distribution of visits over all WWW pages and using it for re- weighting and re-ranking relevant web pages. Navigation- al information are actually considered more important than text keywords. Hence, the more accurate the predic- tive models, the better the search results [Brin and Page, 1998]. Several researchers have undertaken the task of under- standing user’s surfing behavior, by exploiting user data [Adar et al., 2008; Obendorf et al., 2007]. Some go fur- ther, using log data to improve algorithms that predict future requests [Awad et al., 2008 ; Gery and Haddad, 2003], while others apply these algorithms to provide users with improved tools for recommendations, book- marks and history [LeeTiernan, 2003; Pedersen et al., 2010]. As a common practice of studying the past in order to define the future, in this paper we analyze the browser’s log data of 25 users along 6 months with a total of 137,737 page requests. Our detailed, statistical analysis of the navigational data gives insights for our research as well as for future work in the area. In addition, we demon- strate a novel user modeling method for predicting the next page that will be revisited. We tested our model on the data set at hand, with the experimental results demon- strating a significant improvement in the support of Web page revisitation over existing methods, commonly used in this area. 2 Related Work Several past works have explored surfing behaviors with respect to revisitation activity. Although they vary in their estimations, they all recognize that revisitation constitutes a large part of the Web activity. [Herder, 2005], for in- stance, quantifies it to 50% of the overall Web traffic, while [Cockburn and McKenzie, 2001] approximates it to 80%. They also noted that bookmarks, the most popular revisitation supporting tool, invariably involve managing and organizational problems due to the constantly increas- ing size of their collections. Analysis of revisitation. [Tauscher and Greenberg, 1997] describes two important characteristics of revisita- tion: first, most page revisits pertain to pages accessed very recently; the probability for a page to be revisited decreases steeply with the number of page visits since the last visit. Second, there is a small number of highly popu- lar pages that are visited very frequently; the probability for a page to be revisited decreases steeply with its popu- larity ranking. Revisitation behavior has been distinguished by [Oben- dorf et al., 2007] into three different sets: short-term (i.e., backtrack or undo within the same session), medium-term (i.e., re-utilize or observe a resource in a period of time up to few weeks after the first encounter), and long-term revisits (i.e., rediscover a resource several months after the first encounter). The authors further argue that the back button is the most commonly used tool for short- term revisit. For medium-term revisits, the page address is directly typed into the address bar, making use of the automatic URL completion function. However, revisits to a broad range of pages that are accessed on a less frequent basis (i.e., long-term revisits) are poorly supported; users often do not remember the exact address, and ironically browsers do not ‘remember’ the address either. [Adar et al., 2008] demonstrates that short-term revisits involve hub-and-spoke navigation, visiting shopping or reference sites or pages on which information was moni- tored. Medium-term revisits pertain to popular home pag- es, Web mail, forums, educational pages and the browser homepages. As for long term revisits, they involve the use How Predictable Are You? A Comparison of Prediction Algorithms for Web Page Revisitation Ricardo Kawase, George Papadakis, Eelco Herder L3S Research Center Hannover, Germany {kawase, papadakis, herder}@L3S.de 307 of search engines, as well as weekend activities (e.g., going to the cinema). Revisitation Prediction. In [Gery and Haddad, 2003], the authors exploit three methods of Web usage mining: association rules, frequent sequences, and frequent gene- ralized sequence. Association Rules (AR) are well docu- mented in the literature as a method that effectively identi- fies pages that are typically visited together in a same session, but not necessarily in the same order. [Agrawal et al., 1993; Agrawal and Srikant, 1995]. Frequent Sequence Mining can be considered as equivalent to association rule mining over temporal data sets, while Frequent Genera- lized Sequence introduces sequences that allow wildcards, constituting a more flexible means of modeling users’s navigational activity [Gaul and Schmidt-Thieme, 2000]. Their evaluation shows that plain Frequent Sequence Mining performs better in revisitation prediction. Howev- er, their dataset consists of server side logs of 3 different websites, thus covering a limited number of possible revi- sited pages. Contrariwise, in our work we employ brows- ers’s log data to analyze and predict users’s, with the set of revisited web pages potentially involving the whole Web. In [Awad et al., 2008], the authors apply two well- established classification techniques in the context of Web surfing prediction: Markov model and Support Vector Machines (SVM). They also combine them in a hybrid method under Dempster’s rule and the outcomes of their evaluation suggest that it outperforms the individual me- thods, especially when domain knowledge is incorporated into it. 3 Data Set and Revisitation Statistics In this section we briefly introduce the data set that we used for our experiments. We also illuminate the most important aspects of users’s revisit behavior – general characteristics as well as individual differences – which are used as a basis for the predictive methods that are evaluated in this paper. 3.1 Data set The participant pool of our data set consists of 25 partici- pants, 19 male and 6 female. Their average age is 30.5, ranging from 24 to 52 years. The participants were logged for some period between August, 2004 and March, 2005. The average time span of the actual logging periods was 104 days, with a minimum of 51 days and a maximum of 195 days. Participants were logged in their usual contexts - 17 at their workplace, 4 both at home and at work, and 4 just at home. During the logging period, 152,737 page requests were recorded. 10.1% of them were removed, as they were artifacts (advertisements, reloads, redirects, frame sets). Hence, in total we have 137,737 page requests available for analysis. 3.2 Revisitation Statistics We recorded an average revisitation rate of 45.6%. Note that this number is lower than in earlier studies, due to the fact that we took into account both GET and POST para- meters. The wide range of individual revisitation range (between 17.4% and 61.4%) suggests that revisitation behavior is heavily influenced by personal habits, private interests and the sites visited (for more details, [Obendorf, 2008]). In this section we concentrate on individual ferences between users in their revisitation profile. As discussed in Section 2, several studies have identi- fied regularities in revisitation behavior. Users typically have a small set of frequently visited pages, including for example the browser’s home page, search engines, favo- rite news sites, and social networking sites. As can be observed in Figure 1, the distribution of most frequently used pages clearly follows a power law for most of the users, but not for all – some have a large number of pages in their browsing routine. The distribution of revisits to pages based on the num- ber of pages between the last visit and the current visit does follow a power law distribution for all users. Conse- quently, the backtracking activities (revisits to pages in the current session) and routine behavior (revisits to pages in previous sessions) grow roughly linear with the revisit rate. This illustrated in Figure 2 - note that despite the Figure 2: Backtracking and routine behavior plotted against the revisit rate (order by revisit rate). Figure 1: Distribution of most frequently visited pages for each user. Figure 3: Repetitive behavior (% repeated actions) plotted against the revisit rate. 308 correlation there are still users that can be identified as predominantly backtrackers or predominantly routine revisitors. The average percentage of backtracking actions among revisits is 75%, with a minimum of 51% and a maximum of 84%. Predictive models of Web navigation, such as Markov models, typically assume that users exhibit a rather large percentage of repetitive behavior, including sequences of pages that are regularly visited in the same order. In Fig- ure 3 we plot the users’ repetitive behavior (based on the ratio between the number of unique pairs of pages that a user visited consecutively and the total number of transi- tions). The average percentage of repetitive transitions is 20%, with a minimum 5% and a maximum of 60%. 3.3 Discussion Based on the statistics in the previous section, it becomes clear that Web page revisitation behavior follows suffi- cient regularities to be exploited for enhanced revisitation support – in a similar manner as is already common in recommender systems based on Web usage mining and collaborative filtering. Earlier work on revisitation (see Section 2) confirms this observation, but only to a limited extent. In this following, we investigate, compare and combine the performance of several predictive methods for page revisitation. Our analysis attempts a general comparison of several prediction mechanisms, with the aim of identi- fying the best performing one, knowing though that their performance depends heavily on the regularities in the individual user’s revisitation activities. 4 Prediction of Next Page Visits The problem we are tackling in this paper can be formally defined as follows: Given a collection of Web Pages, P = {p1, p2, ...}, that have been visited by a user, u, during his past n transac- tions, Tu={t1, t2,…, tN}, rank them so that the ranking position of the page re-visited in the next, n+1, transac- tion is the highest possible. The methods copying with this problem should exclu- sively try to facilitate the revisitation of already accessed pages , rather than trying to suggest to a user new pages that seem relevant to his surfing activity. The ranking of all web pages is updated after every transaction, and the higher the ranking position of the subsequently accessed page, the better. In fact, the lowest possible Average Ranking Position (ARP) of revisited pages, the higher the performance of the algorithm. This is in line with the intuition behind ranking search engine’s query results: the higher the ranking of the desired resource, the better the performance of the search engine [Brin and Page, 1998]. To solve the aforementioned problem, we employ a framework combining two categories of methods. The first one involves ranking methods: they estimate for each web page the likelihood that it will be accessed in the next transaction based on some evidence, such as the recency or the frequency of earlier visits to this page. The second category covers propagation methods; these are tech- niques that capture repetitiveness in the surfing behavior of a user and identify groups of pages that are typically visited together, in the same session but not necessarily in a specific order. In the following, we provide a brief outline of our framework that conglomerates these two categories of prediction methods. The implementation of the methods presented here is publicly available through the SUPRA project of SourceForge.net1. 4.1 Ranking Methods The aim of ranking methods is to provide for each web page a numerical estimate of the likelihood that it will be accessed in the next transaction. In this work, we consider the following ranking methods: 1. Least Recently Used (LRU) 2. Most Frequently Used (MFU) 3. Polynomial Decay (PD) The first two methods, namely LRU and MFU, consti- tute well-established caching algorithms that are typically employed in prediction tasks. LRU is based on the idea that the more recently a web page was visited, the most likely that it will be re-visited in the immediate future. Hence, it assigns the highest ranking position to the latest accessed page. MFU, on the other hand, relies on the idea that the more often a web page is visited, the most likely it is to be revisited in the next transaction. [Papadakis et al., 2010] demonstrated, though, that these methods are not adequate for effectively predicting future revisitations on server-side logs of closed corpus websites. Due to their unidimensionality, LRU produces a plainly chronological arrangement of web pages based on their recency, while MFU takes into account merely their degree of usage. More accurate predictions can be achieved when incorporating both evidences into a single, comprehensive method. To this end, [Papadakis et al., 2010] introduced the de- cay ranking model for predicting the next revisited page. According to this model, the value vin of a web page wi after n transactions Tu of user u is derived from the fol- lowing formula:   ∑ ,  ,  , where d(tk,wi, n) is a decay function that takes as an input the k- th transaction, tk, of user, u, together with the index of the current transaction, n, and gives as output the value of this transactions for web page wi . Every valid decay function should satisfy the following properties ([Cormode et al., 2009]): 1. d,  ,  1 when k=n 2. ,  ,  0 if tk doesn’t pertain to web page wi 3. 0  ,  ,  1  0  k  n 4. d is monotone non-increasing as n increases (0≤k≤n):    ,  ,   ,  , Among the various decay function families that satisfy these properties, the polynomial decay functions were found to outperform both the exponential and the loga- rithmic ones. The reason is that their smooth decay bal- ances harmonically the recency and the degree of usage of web pages; in contrast to this, exponential functions con- vey a steep decay that puts more emphasis on the recency of usage, whereas the logarithmic functions promote ex- cessively the degree of usage, due to their excessively slow decay. The form of a polynomial decay function with expo- nent α is the following: 1 http://sourceforge.net/projects/supraproject/ 309 ,  ,      , where b is equal to 1 if tk pertains to wi, and 0 otherwise. 4.2 Propagation Methods Unlike ranking methods that produce an ordering of web pages, propagation methods aim at detecting and captur- ing patterns in the surfing activity of users. They identify those pages that are commonly visited within the same session and associate them with each other. The “links” created by these methods can be combined with a ranking method, so that the value of a web page is propagated to its relevant ones. In this way, the higher the value of a web page, the more the pages associated with it are boosted and the higher their ranking position. In this work, we distinguish between two families of propagation methods: those that take into account the order of the transactions within a session, and those who disregard this order. For the former case, we consider transition matrices, whereas for the latter we examine association matrices. 4.2.1 Transition Matrix Similar to a first-order Markov model, a transition matrix (TM) is a two dimensional structure with its row and columns representing the enumeration of web pages; each cell TM(x,y) expresses the number of times that a user visited page y after x. Given that a transition matrix re- spects the order of accesses within a session, it is not a symmetrical one: the value of TM(x,y) is not necessarily equal to that of TM(y,x). Moreover, its diagonal cells are all equal to 0: TM", "  0 ". In the following, we introduce 4 different approaches to correlating web pages according to the past navigation- al activity in order to build the transition matrix. They can be intuitively illustrated through a simple walkthrough example. Given a set of 4 web pages – A, B, C, D - and the following set of transactions during a user session we can associate these pages in four ways (taking into account the order of the accesses): 1. Simple connectivity – For each transition x→y in the given session, only the value of the cell TM(x,y) is in- cremented by one. Figure 4a) depicts the values of the transition matrix according to the simple connec- tivity rule after the last transition of the given session D→A. 2. Continuous connectivity – Each web page visited within the current session is associated with all the subsequently accessed pages. In our example, after transition D→A, A is associated with all other web pages (B,C,D) incrementing the corresponding cells by one, as shown in Figure 4b). 3. Decreasing continuous connectivity – This strategy operates in a similar way as the previous one (i.e., connecting all the pages within a session) with the difference that it adds a decay parameter representing the distance (i.e., number of transitions) that intervene between two web pages. In our example, cell (C,A) is incremented by ½ after D→A, since page C is two steps away from the page A. Figure 4c) depicts the values of the transition matrix according to the decay- ing continuous connectivity rule after the transition D→A. 4. Increasing continuous connectivity – Is the inverted version of the previous strategy. Instead of decreasing the additional value of cell TM(x,y) according to the distance of pages x and y, it increases it proportional- ly. The outcomes of this rule after transition D→A are presented in Figure 4d). It is worth noting that the simple connectivity transition matrix was also used in [Awad et al., 2008], but its fre- quencies were used as features of a classification algo- rithm instead.               0001 1000 0100 0010 D C B A DCBA (a)               0001 1001 1101 1110 D C B A DCBA (b)               0001 100½ ½10¼ ¼½10 D C B A DCBA (c)               0001 1002 2104 4210 D C B A DCBA (d) Figure 4: Transition matrix example. 4.2.2 Association Matrix In contrast with transition matrices, association matrices (AM) are based on the idea that the temporal order of transactions within a session is not important; pages that are visited in the course of the same session should be equally connected with each other, regardless of their order and the number of transitions that intervene between them. The rationale behind this idea is that users may visit a group of pages XYZ on a regular basis, but not neces- sarily in that order. In this context, an association matrix is built simply by associating all the pages that are visited in a single ses- sion. Given the session presented above, the resulting AM has all non-diagonal cells equal to one, as all resources were accessed during this session (Figure 5). A variation of the association matrix can be derived by normalizing its values with the help of the mutual infor- mation. More specifically, this involves the multiplication of each cell TM(x,y) of AM with the following mutual information factor (mif): #$%", &  '(", & · log -.,/ -. ·-/ , where • AM(x,y) is the number of sessions containing both page x and y (i.e., the value of the cell AM(x,y)), • p(x,y) is the probability of a session to contain pages x and y (i.e., the value of AM(x,y) divided by the number of sessions) A B C D A 310 • p(x) (p(y)) is the probability that a session contains page x (y) (i.e., the number of sessions with x or y di- vided by the total number of sessions). Without this smoothing factor, the values of AM are bi- ased towards pairs of pages that have a high frequency of co-occurances, although they are not highly correlated.               0111 1011 1101 1110 D C B A DCBA Figure 5: Association matrix example. 4.3 Combining Ranking with Propagation me- thods To combine the available ranking methods with the varia- tions of the propagation techniques, we employ a simple, linear scheme: following a transaction, the value of each web page is first (re)computed, according to the selected ranking method. Then, for each non-zero cell of the transi- tion matrix at hand, TM(x,y), we increase the value of page y, vy, as follows: /0 1"  & · ., where • p(x→y) is the transition probability from page x to page y, estimated by 1"  &  23.,/ ∑ 23., 45 , and • vx is the value of x estimated the ranking method. In case an association matrix is used as a propagation method, the vy is increased as follows: 1. /0 '(", & · . for the plain association matrix, or 2. /0 #$%", & · . for the mutual-information- normalized association matrix. All in all, considering the 3 ranking methods alone and in combination with the 4 variations of TM and the 2 varia- tions of AM, we have 21 distinct ranking methods. Due to space limitation and for the sake of readability, the fol- lowing, evaluation section focuses merely on the best performing ones. 5 Evaluation Setup and Discussion of Re- sults 5.1 Setup To evaluate experimentally our framework of methods, we employed the data set described in section 3, compris- ing 25 distinct users and 137,737 page requests in total (not evenly distributed among the users). In more detail, we simulated the navigational activity of each user, inde- pendently of the others. After each transaction, the rank- ing of all visited pages was updated, and, in case the next access was a revisitation, the position of the correspond- ing web resource was recorded. Having all these ranking places for each recommendation method, we derived the following metrics for evaluating its performance: • Precision at 10 (P@10): it expresses the percentage of revisitations that involved a web page ranked in some of the top 10 positions. The higher this percen- tage, the better the performance of the recommenda- tion method. This metric provides evidence for the usability of the prediction method, as users typically have a look only at the first 10 pages presented to them (just like they do with web search engine re- sults). • Average Ranking Position Reduction Ratio (RR): it denotes the degree of improvement conveyed by the prediction method in comparison with the actual revi- sititation behavior of the user. More specifically, it is computed from the following formula: 66  7879::;79: 7879: · 100%, where  AcARP is the Actual Average Ranking Position of the user, representing the average distance in terms of the number of page requests that inter- vene between the revisited web pages, and  PrARP is the Prediction Average Ranking Posi- tion, expressing the place a revisited page is found on average in the ranking list that the prediction method produces. The higher the value of RR, the better the performance of the recommendation algorithm, with negative values denoting that PrARP is lower than AcARP (i.e., no im- provement with respect to the actual revisitation behavior of the user). RR provides, thus, an estimation of the over- all performance of a prediction method, since it considers the performance over all the revisitations in the naviga- tional history of a user, and not only the highest ranked ones. On the whole, the combination of these two metrics provides a comprehensive estimation of the effectiveness of a recommendation algorithm in predicting the next revisited page; they cover both the recommendations that are indeed useful for users as well as their performance in all the cases. 5.2 Results analysis Regarding the performance of the ranking algorithms we are considering, it is summarized in Figure 6, with the RR depicted in Figure 6a) and the P@10 in Figure 6c). It is evident that the baseline MFU performs much worse than the other methods. This is explained by the fact that back- tracking (LRU) is more common than revisiting popular sites, thus ensuring much higher performance for LRU. Our proposed method, the Polynomial Decay, which is a combination of MFU and LRU, exhibits the best perfor- mance for all users, improving in each case that of LRU to a varying but considerable extent. The performance of PD is significantly enhanced when combined with AM and TM, with TM accounting for a higher improvement. This is the case with respect to both metrics, as is clearly depicted in Figure 6b) for RR and Figure 6d) for P@10. Conversely, the combination of LRU and MFU with AM and TM results in a lower per- formance for both metrics (that’s why their performance is not included in the figures). This suggests that users do not have many regular patterns in their page visit behavior (i.e. after having visited page X they do not always visit page Y). It is interesting to note, though, that PD achieves by far the best results in combination with the Simple TM, while LRU and MFU are better combined with the In- creasing and the Decreasing TM, respectively. Another observation is that, despite the different as- sumptions that lie behind the algorithms, there is a corre- 311 c) d) a) b) lation between the performances of the algorithms per user. This can be observed in Figure 7, where the better the performance of the best-performing algorithm, PD+TM, the better the performances of PD, PD+AM and LRU. From the same figure it also becomes clear that there is no correlation at all with the revisit rate: one would expect that users who revisit pages more often – who are shown to have more frequent transitions – are more predictable in their behavior; this turns out not to be the case. Note also that the – poor – performance of MFU does not follow the pattern of the other algorithms and is not correlated with the revisit rate either. 5.3 Discussion In our analysis we compared various algorithms and com- binations of algorithms for predicting which pages users will revisit in a session. These algorithms exploit the fol- lowing characteristics of revisits: • Revisits are typically focused on pages visited very recently. • The more revisits, the more repetitive behavior in terms of transitions between pages. • There is a small group of pages that is visited very frequently. It turns out that, even though a small set of frequently visited pages covers the majority of revisits, the recency effect, as well as frequent transitions, plays a larger role in the prediction algorithms. The evaluated algorithms, in particular Polynomial De- cay in combination with the transition matrix, significant- ly improve upon the list of most recently used pages, in particular for users whose the list of LRU pages performs relatively bad. The differences become smaller together with the increase of the recency effect. The counter-intuitive effect that a higher recency rate does not lead to better predictions can be explained by the many differences in individual behavior between users (such as the number of news sites or bulletin boards that a user actively follows, strategies for search and backtrack- ing, the number of reoccurring activities). From this we can conclude that revisiting behavior is mainly influenced by the recency effect, but it definitely makes sense to take the popularity of pages and the cur- rently/last visited page (the user's current context) into account as well. Figure 6: a) Reduction ratios of the average ranking position for LRU, MFU and PD. b) PD with AM and TM(simple. c) Precision at 10 for LRU, MFU and PD. d) PD with AM and TM(simple) on the bottom right. Figure 7: Performance (P@10) of the different algo- rithms per user. Users are ordered by the best- performing algorithm. 312 6 Conclusions In this paper we studied the browsing behavior of 25 users during a period of approximately 6 months. We analyzed the data to build a comprehensive stereotype of users’ behavior, focusing on their revisitation patterns. We also ran experiments applying a variety of methods to predict users’s revisitation. Our proposed Polynomial Decay algorithm in combina- tion with users’s navigational patterns as they are encap- sulated by the Transition Matrix outperforms substantially existing methods commonly used for revisitation predic- tion. In our previous work [Papadakis et al., 2010] we dem- onstrated the better performance of our methods on a server-side dataset. Combining with the results of the work presented here (on a client-side dataset) we can firmly claim that our proposed method is more effective than the baselines LRU and MFU for both cases. Though the experiments presented here are on the field of Web Usage Mining, our real goal is to improve the support of revisitation by the means of intelligent user interfaces. Hence, this was the first step towards a more effective user modeling method. Our plan, as future work, is to implement a browser interface that allows users to interact with the output of our methods: a collection of related URLs that does not contain only the obvious selec- tions, but also related websites that are usually overlooked between the head and the long tail. References [Adar et al., 2008] Adar, E., Teevan, J., and Dumais. S. T: Large scale analysis of Web revisitation patterns. In CHI, pages 1197-1206, 2008. [Agrawal et al., 1993] Agrawal, R., Imielinski, T., and Swami, A. N:. Mining association rules between sets of items in large databases. In SIGMOD, pages 207–216, 1993. [Agrawal and Srikant, 1995] Agrawal, R., and Srikant, R.: Mining sequential patterns. In ICDE, pages 3–14, 1995. [Awad et al., 2008] Awad, M., Khan, L., and Thuraising- ham, B.:Predicting WWW surfing using multiple evi- dence combination. In The VLDB Journal 17, 3, pages 401-417, 2008. [Brin and Page, 1998] Brin, S., and Page, L.: The anatomy of a large-scale hypertextual web search engine. In Computer Networks 30(1-7), pages 107-117, 1998. [Cockburn and McKenzie, 2001] Cockburn, A., and McKenzie, B.: What do Web users do? An empirical analysis of Web use. In Int. J. of Human-Computer Studies, 54(6), pages 903-922, 2001. [Cormode et al., 2009] Cormode, G., Shkapenyuk, V., Srivastava, D., and Xu, B.: Forward Decay: A Practical Time Decay Model for Streaming Systems, In ICDE, pages 138-149, 2009. [Gery and Haddad, 2003] Gery, M., and Haddad, H.: Evaluation of web usage mining approaches for user’s next request prediction. In WIDM, pages 74–81, 2003. [Gaul and Schmidt-Thieme, 2000] Gaul, W., and Schmidt-Thieme, L.: Mining web navigation path fragments. In WEBKDD, 2000. [Herder, 2005] Herder, E.: Characterizations of user Web revisit behavior. In Proceedings of Workshop on Adap- tivity and User Modeling in Interactive Systems, 2005. [LeeTiernan, 2003] LeeTiernan, S., Farnham, S., and Cheng, L.: Two methods for auto-organizing personal web history. In CHI Extended Abstracts on Human Factors in Computing Systems, pages 814-815, 2003. [Obendorf et al., 2007] Obendorf, H., Weinreich, H., Herder, E., and Mayer, M.: Web page revisitation revi- sited: Implications of a long-term click-stream study of browser usage. In CHI, pages 597-606, 2007. [Papadakis et al., 2010] Papadakis, G., Niederee, C., Nejdl, W.: Decay-based Ranking for social application content. In WEBIST, 2010. [Pedersen et al., 2010] Pedersen, E. R., Gyllstrom, K., Gu, S., and Hong, P. J.:Automatic generation of research trails in web history. In IUI, pages 369-372, 2010. [Tauscher and Greenberg, 1997] Tauscher, L., and Greenberg, S.: How people revisit Web pages: Empiri- cal findings and implications for the design of history systems. InInternational Journal of Human-Computer Studies, v.47, n.1, pages 97-137, 1997. 313 User and Document Group Approach of Clustering in Tagging Systems Rong Pan, Guandong Xu and Peter Dolog IWIS — Intelligent Web and Information Systems Department of Computer Science Aalborg University {rpan, xu, dolog}@cs.aau.dk Abstract In this paper, we propose a spectral cluster- ing approach for users and documents group modeling in order to capture the common preference and relatedness of users and doc- uments, and to reduce the time complexity of similarity calculations. In experiments, we investigate the selection of the optimal amount of clusters. We also show a reduc- tion of the time consuming in calculating the similarity for the recommender systems by se- lecting a centroid first, and then compare the inside item on behalf of each group. keywords: User Profile, Document Profile, Spectral Clustering, Group Profile, Modular- ity Metric 1 Introduction The success of social tagging resulted in the prolifera- tion of sites like Delicious, CiteUlike, Digg, or Flickr. Such sites contain large amount of user tagged data for information retrieval in social-tagging systems [6; 7; 12; 16; 17], or for the establishment of user profiles and the discovery of topics, among other applications. [5] uses the tags associated with specified objects to build a single user profile. However, here comes a problem: it is hard to express the entire user profile or the document profile. The traditional user profile ex- presses the users’ preferences depending on collecting users’ behaviors information, such as provides many tedious options in their registrations. The disadvan- tage with such an approach is too much reliance on users who is not able very often to express his entire user profile and interests. The document profile shows the background, categories, and keywords, it also de- pends on the description when it is added into the system. However, with the increase of the number and types of users, it’s hard to express the different emphases for various users with the same document. In social collaborative (tagging) systems, the com- mon perception or judgment on documents are deter- mined by a group of users rather than a single user. In a similar way, by using a group of documents, rather than one document, it might represent much more spe- cific information during the information search. Therefore, our assumption is that by utilizing the community views of users and documents, we are able to facilitate the organization of information resources in search and navigation. In social tagging systems, users express their judg- ment by annotations or tagging. The tag can endorse their opinions on various web items, which is one of the defining characteristics of Web 2.0 services, allow- ing them to collectively classify and index information for later search and sharing. With social tagging, a user can express his own perspective on web items, e.g. resources like images, videos, scientific papers, thus allowing other like-minded users to find and use the similar information. The tagging has been already utilized for organiz- ing the resources. [3] develops a page rank algorithm of resources based on preference tag vectors. [6; 8; 7] investigate social and behavioral aspects of a tag- based recommender system which suggests similar web pages based on the similarity of users’ tags. However, there is another problem emerging: not all of the social tagging systems proposed so far maintain high quality and quantity of tag data. It is particularly prominent when a new user enters the system or a new document is added into the system. If the individual user profile or document profile can be collected and grouped into several groups character- ized by the significant tags, it is believed that common tags annotated by the most objects inside the group can reflect the characteristics of user preference or doc- ument functionality. Moreover, it will be of benefit for solving the problem of low tag quality of individual user or document. Even when a new user or a new document is added into the system, the tags can be extended to the user by referring to the majority tag- ging behavior of users on documents. Regarding to the previous problems, even if the tag is rich enough for the users and documents, the time consuming is still very high when a user wants to get the most appropriate document from a large document database, since the system has to calculate the simi- larity between users or documents one by one. We propose the method that calculates the similar- ity between the target tag vector and the centroids of all clusters to determine the cluster with highest sim- ilarity, then calculate the similarity of the target tag vector with the document profiles inside the cluster to rank the whole documents. In such way the time con- suming can be reduced. Since we have got the groups of user profile or document profile, how to choose the number of clusters is another problem. The traditional way is to assign the initial clustering number manu- ally. In this paper, we use the modularity metric [13] to evaluate the optimal number for the clusters. Based on the problems mentioned above, this paper 315 proposes an approach for group modeling by utilizing a clustering algorithm. The group modeling aims at as- signing the individual users or document profiles into different groups, which correspond to various user pref- erences or content relatedness from the large amount of data for tagging. User Group Profile and Document Group Profile can be generated from individual user profiles and docu- ment profiles; both of them are expressed by the tags. Group profiling is not constructed based on stereo- types but based on the results of clustering algorithms from transactional data. It can identify the objects in- side the community with similar tags, and collect the data for the similar objects. It can expand the tag set for the individual object inside the community which is helpful for the poor tag quality and quantity. Further- more, for making tag-based recommendation, it will significantly reduce the time consuming in calculating the similarity between the user and document groups. The main contributions in this paper are: 1. A group modeling method by utilizing the clus- tering algorithm. 2. The most appropriate number of clusters to gen- erate the User Group Profile and the Document Profile by using the modularity metric. 3. Reduction in time needed for computation for or- ganizing the documents comparing to the other methods. The rest of the paper is organized as follows: Section 2 presents the related work in the field of clustering and profiling. In section 3, we describe the preliminaries for the data model. Section 4 discusses the details of user profile and document profile with the introduced mathematical models and how to get the group pro- file by utilizing the spectral clustering algorithm. The experiment is designed in terms of datasets and evalu- ation measures in section 5, and experimental results and comparisons are presented in this section as well. We conclude the paper and discuss possible future re- search directions in section 6. 2 Related Work The folksonomy in [3] has been defined as a data struc- ture that evolves over time when people annotate re- sources with freely chosen words. It is user-contributed data aggregated by collaborative tagging systems. In such systems, users are allowed to choose terms freely to describe their favorite web resources. A folkson- omy is generally considered to consist of at least three sets of elements, namely users, tags and resources. Al- though there can be different kinds of resources. The prerequisite of personalization is to acquire user profile that describes user’s interests, prefer- ences and background knowledge about specified do- mains. Methods are used for modeling user pro- files include logic-based representation and inference, Bayesian models, feature-based filtering, Clique-based filtering, and neural networks. However, such user pro- file is still for individual persons. Our approach is to cluster the similar users in the same communities. [5] proposes to create user profiles from the data available in such folksonomy systems by letting user specify the most relevant objects in the system. Instead of using the objects directly to represent the user profiles, they use the tags associated with the specified objects to build the user profiles. [20] presents analysis on the personal data in folk- sonomies, and investigates how accuracy rate user pro- files can be generated from this data. They propose an algorithm to generate user profiles which can accu- rately represent the multiple interests. F. Durao and P. Dolog in [6; 8; 7] present a tag- based recommender system which suggests similar Web pages based on the similarity of their tags from a Web 2.0 tagging application. They also propose an approach to extend the basic similarity calculus with external factors such as tag popularity, tag represen- tativeness and the affinity between user and tag. K. R. Bayyapu and P. Dolog in [2] tries to solve the problems of sparse data and low quality of tags from related domains. They suggest using tag neighbors for tag expression expansion. However the tag neighbors are based on the content of documents. We propose another approach to extend the tag set by the group profiling. [19] uses a framework of User-Profile Modeling based on Transactional data for modeling user group profiles based on the transactional data which can incorporate external information, either by means of an internal knowledge base or on dynamic data supplied by a spe- cific information extraction system. Such user group profiles consist of three types: basic information at- tributes, synthetic attributes and probability distri- bution attributes. User profiles are constructed by clustering user transaction data and integrating clus- ter attributes with domain information extracted from application systems and other external data sources. And Teevan et al. apply group profiles to personalize search by an algorithm to “groupiz” (versus “person- alize”) in result ranking on group-relevant queries [18] , Abel et al. [1] shows that the quality of search result ranking in folksonomy systems can be significantly im- proved by introducing and exploiting the grouping of resources and Mei and Church show that group profiles facilitate Web search [11]. Clustering can divide the large amount of data into several groups. Clustering algorithms, specially de- signed for transactional data, can efficiently partition historic user transactions into clusters [9][15]. Each cluster is a set of transactions representing the inter- ests of a particular user group. It is the assignment of a set of observations into subsets so that observations in the same cluster are similar in some sense. We want to use the clustering for unsupervised learning in the group profiling. 3 Preliminaries for Folksonomy Data Model The user profile can be used to store the description of the user’s characteristics. Such information can be exploited in social tagging systems for taking the per- sons’ characteristics and preferences into account . For example, the social tagging systems usually ask the users to choose their own words as tags to describe the favorite web resource. So the user profiles can jus- tify the benefit and interest for various users. The document profile is represented by the meta- data generated by the community of users tagging the 316 documents. It is the process that refers to the con- struction of a profile for a specific via the extraction from a set of tagging data. When users want to annotate web resource for bet- ter organization and use the relevant information to their needs later, they will tag such information with free-text keywords. The tags, which are given by the users, reflect the navigational preference and interest of them. On the other hand, with the increase of doc- uments number that the user visited and annotated, each user has his own tag set which characterizes the interest or preference. Likewise, each tagged document also has its own tag set which expresses the content re- latedness and subject of the document. In the context of social tagging systems, the user profiles and docu- ment profiles thus are expected to be represented by the representative tags. Therefore the process of user and document modeling is to capture the significant tags from a large volume of tagging data in a social collaborative environment. There are a number of studies on user and document profiling (see for example [20; 19]). Amongst them, the basic idea of such approaches is originated from the in- troduction of a specific mathematical modeling of folk- sonomy. The folksonomy is a three-dimensional data model of social tagging behaviors of users. In social tagging systems, both the user profiles and document profiles are formulated starting from the folksonomy model. In the following section, in order to well reveal the mutual relationships between these three-fold en- tities, i.e. user, item and tag, we firstly briefly discuss the data model used in the following group profiling processes. A folksonomy F according to [10] is a tuple F = (U, T, D, A), where U is a set of users, T is a set of tags, D is a set of Web documents, and A ⊆ U ×T ×D is a set of annotations. The relationship is shown as Fig1. Figure 1: Relationship of users, tags, resources in folk- sonomy We can construct the folksonomy data model from the tagging data by such following steps: collecting the data of users, tags and resources from the ex- plicit information and implicit information. And then represent them in the three-dimensional vector space. Based on this we can define the documents’ data as the X coordinate, the users’ data as the Y coordinate, and tags’ data as the Z coordinate. The relationship in folksonomy is Rtagging = U × T ×D, Rtagging ∈ A, where U = {U1, U2, ..., Um} is the set of users and T = {T1, T2, ..., Tk} is the set of tags,D = {D1, D2, ..., Dn} is the set of documents. Shown in Fig1, for each point in the three-dimensional vector space, it can be defined as user u ∈ U has tagged document d ∈ D with tag t ∈ T . Upon the folksonomy data model, we can derive the user and document profile by utilizing the relationship among the users, tags and documents in the tagging procedures, which will be discussed in the following section. 4 User Group Profiling and Document Group Profiling by Clustering As mentioned in the introduction section, the poor quality and quantity of tag data would be a problem. Meanwhile, the time complexity is also a big concern when calculating the recommendation rank for the ob- jects based on the large amounts of data. In the follow- ing parts, this paper will focus on solving such prob- lems. 4.1 User Profile and Document Profile In the social tagging systems, we can get the user pro- file and document profile by utilizing and analyzing the relationships among the users, tags and documents modeled in folksonomy. First of all, we discuss the user profiling. For a given user, if we want to study his interests, only the tags, associated with documents, need to be concentrated on. In the folksonomy data model, we can use a user vector assgined with a unique id, for example, the user Ui ∈ U, i= 1,...,M. Figure 2: Matrix UT and Matrix DT in folksonomy As shown in Fig2, a two-dimensional matrix UTi is extracted from the relationship between the doc- uments and tags for a particular user. In UTi , each column is corresponding to the documents Dn ∈ D,n= 1,...,N that used by user Ui , and each row is corresponding to the tags Tk ∈ T, k = 1, ...,K. UTi =  u11, u12, · · · , u1n u21, u22, · · · , u2n ... ... . . . ... uk1, uk2, · · · , ukn  , ukn ∈ {0, 1} Here ukn means that if there exists an association between tag Tk and document Dn, annotated by user Ui, the ukn sets to 1, otherwise it is 0. By accumulating the row of matrix UTi, the fre- quency of tag is defined as tik = N∑ n=1 ukn which re- veals the user’s preference and interest. Then we can obtain the full set of the pairs of tags and their frequency weights. So the profile of userUi in the form of tag set can be defined as UPi = 317 {(T1, ti1), (T2, ti2) · · · (Tk, tik)}, k = 1, · · · ,K, where tik = N∑ n=1 ukn, Tk ∈ T, k = 1, · · · ,K. Similarly, given a document Dn, we can obtain an- other two-dimensional matrix DTi, where each column denotes the user Ui and row is the related tags that the userUi is used to annotate the documentDn. The size of matrix DTi is M users by K tags. DTi =  v11, v12, · · · , v1m v21, v22, · · · , v2m ... ... . . . ... vk1, vk2, · · · , vkm  , vkm ∈ {0, 1} The elementvkm in DTi is corresponding to the user Um and tag Tk. The value of vkm is defined that, if there exists an annotation between tag Tk and user Um, that means Tk is associated with the Um, the vkm sets to 1, otherwise it is 0. By accumulating the row of matrix DTi, the fre- quency of tag is defined astik = M∑ m=1 ukm Then we can obtain the full set of the pairs of tags and their frequency weights. So the profile of document Dn in the form of tag set can be defined as DPi = {(T1, ti1), (T2, ti2) · · · (Tm, tim)},m = 1, · · · ,M , where tik = M∑ m=1 ukm, Tk ∈ T, k = 1, · · · ,K. From the steps mentioned above, the user profiles and document profiles are defined as a single user or document respectively rather than a group of users or documents. However, in social tagging systems, the group profiles of users or documents are more likely to reflect the common preference or relatedness of like- minded users or documents with similar functionality. In the following section, we will discuss the group pro- filing approach by using clustering. 4.2 Similarity Matrixes for the Users and Documents The relationship among all of the users is to calculate the similarity. The similarity is quantity that reflects the strength of relationship between two objects. In the last part, each user profile can be represented by the pair of tags and frequencies. We utilized the cosine distance between users. Its value ranges from 0 to 1, the higher value of the similarity, the more similar the objects are. The similarity matrix SM(Ui, Uj)is given by, SM(Ui, Uj) = UPi·UP |UPi|×|UPj | The users’ relationship can be represented in the form of bipartite graph model. Given a graph G = (U,E), where U is a set of users as Ui = {U1, U2, · · · , Um}, and E is a set of edges which en- try SM (Ui, Uj) reflects the similarity between users, the similarity matrix SM(ui, uj). Similarly with the document profiles, we can define a graph with N users G = (D,E), where D is a set of documents as Di = {D1, D2, · · · , Dn}, and E is a set of edges which entry SM (Di, Dj) reflects the similarity between documents. The similarity matrix SM(Di, Dj)is given by: SM (Di, Dj) = DPi·DPj |DPi|×|DPj | . 4.3 Group Profiling via Clustering Algorithm To accomplish the group profiling, one of the ap- proaches is to group the user profile and document profile into several groups based on the similarity so that the objects in the same groups can share tag set. The clusters of users or documents reveal the common user preference or relatedness of documents. It can benefit the user in the same group to share the similar interests or documents. Clustering algorithm aims to assign a set of obser- vations into subsets so that observations in the same cluster are similar in some sense. It is specially de- signed for transactional data and can efficiently par- tition historic user transactions into clusters [8; 14; 12]. Each cluster is a set of transactions representing the interests of a particular user group. So it can find the potential groups from the user profile and docu- ment profile. The object of clustering is: similar objects have high similarity, the similarity is low between objects of dif- ferent clusters. Clustering is the process to adjust the ranks of the similarity matrix, by a number of ma- trix blocks to meet the similarity value larger among the inside elements, while the similarity value is small between the clusters. There are a lot of clustering algorithms such as k- means, fuzzy c-means, single linkage and so on. In clustering analysis, almost all approaches are based on the similarity between subjects to partition the data points. Various clustering approaches have different advantages and drawbacks. Among the traditional clustering algorithms, spectral clustering has the supe- rior capability of effectively group data by leveraging the statistical property of similarity matrix of data. In this paper, we will introduce the Spectral Clus- tering Algorithm. Spectral clustering refers to a class of techniques which rely on the eigenvalues of the ad- jacency similarity matrix; it can partition all of the points into disjoint clusters, the points that have high similarity will be classified under the same cluster. One cluster’s points have low similarity with other clusters’ points. The spectral clustering is based on the graph partition. We have explained how to get the similarity matrix from the graph in 4.2. It maps the original inherent relationships onto a new spectral space, on which the user or document profile is pro- jected. After the projection, the whole user or docu- ment profiles are simultaneously partitioned into dis- joint clusters with minimum cut optimization. Compared to those algorithms, spectral clustering has many fundamental advantages. Results obtained by spectral clustering often outperform the traditional approaches, it is easy to implement and can be solved efficiently by standard linear algebra methods. Spec- tral clustering techniques make use of the spectrum of the similarity matrix of the data to perform dimen- sionality reduction for clustering in fewer dimensions. The original formula for the spectral clustering is: L = I −D−1/2SD−1/2 According to the spectral graph theory in [4], the k singular vectors of the reformed matrix RMUser = D−1/2SMUserD−1/2 present a best approximation to the projection of user-tag vectors on the new spectral space. 318 And the RMDocument = D −1/2SMDocumentD−1/2 presents the document-tag vector on the new spectral space. The Du and Dd is the diagonal ma- trix of user similarity matrix and docu- ments similarity matrix, which are defined as: Du(i, i) = N∑ j=1 SM (Ui, Uj) , i = 1, · · · ,M Dd(i, i) = N∑ j=1 SM (Di, Dj) , i = 1, · · · , N Let’s take the documents set for example. In this case we assume that the first K singular eigenvectors represent the best approximation of orig- inal profile space. Let Ls the m × k matrix of the k singular vectors of RMDocument. As our aim is to con- duct a clustering on the document profile attributes, we create a new m×k matrix RV to reflect the projec- tion of the row and column vectors on the new spectral space in lower dimension as: RV = [D −1/2 d Ls]. The clustering results in the group profiling. The full steps of group profiling via clustering algorithm is summarized in the below Algorithm. Input: The N document profile collec- tion DP = {DPi |i = 1, 2 · · ·N }, DPj = {(T1, t1), (T2, t2) · · · (TK , tK)}. Output: A set of k clusters DGP = {DGPi |i = 1, 2 · · · k }such that the cut of k - partitioning of the bipartite graph is minimized. 1. Construct the usage similarity matrix SMDocument from the document profile, whose element is determined by the distribution of tags of all users; 2. Calculate the diagonal matrixes Dd; 3. Form a new matrix RMDocument = D−1/2SMDocumentD−1/2; 4. Perform SVD (Singular value decomposition) op- eration on RMDocument, and obtain k singular vectors Ls create a new projection matrix RV ; 5. Execute a clustering algorithm on RV and return clusters of documents:DGP = {DGPi |i = 1, 2 · · · k }. From the above steps, the N documents are divided into t clusters, the document group profile for each cluster is: centerlineDGPi = {UPi1, UPi2, · · ·UPit} = {(T1, wi1), (T2, wi2) · · · (Ti, wit)} Where Ts ∈ T, s = 1, · · · , t and (wi1, wi2, · · ·wit) is the centroid of the document cluster DGPi. Meanwhile, the selection of cluster number k is an- other concern in the context of clustering, which is commonly encountered. The selection of k value has a straight impact on the performance of clustering: the bigger number of k results in the over-separation of users and documents, while the smaller number of it prevents the data from being sufficiently partitioned. Thus it is necessary before performing the clustering to select an appropriate value of k to achieve a better clustering performance. In the experimental part, we will investigate the study of k selection. In similar way, the user group profile of k users can be generated in the same way: UGPi = {DPi1, DPi2, · · ·DPij} = {(T1, xi1), (T2, xi2) · · · (TK , xiK)} Where Ts ∈ T, s = 1, · · · , t and (xi1, xi2, · · ·xit) is the centroid of the user cluster UGPi. 5 Experimental evaluations In order to evaluate the proposed group profiling, we performed experiments on the “MovieLens” dataset. Our experiments focus on the cluster number selection; demonstration of the group profiles; and the computa- tional cost reduction. 5.1 Dataset and Modularity Metric As for experiment dataset, we utilize the part of the “MovieLens” data, which contains tags provided by users on movies. It includes 521 users, 1399 documents and 1956 tags. The differences of results are shown when choosing different numbers of clusters in order to get the optimal number of clusters. The data is based on the average result of executing the same experiment ten times over the same dataset. The modularity metric is one of the standard quan- titative measures for the evaluation of “goodness” of the clusters. The modularity of a particular division of a network is calculated based on the differences be- tween the actual number of edges within a commu- nity in the division and the expected number of such edges if they were placed randomly. Good divisions, which have high values of the modularity, are those dense connections between the nodes within modules but sparse connections between different modules. It will help to evaluate the quality of the cluster; i.e. the similarity of each cluster. After clustering, we can get several clusters. Con- sider a particular division of a network into k com- munities. We can define a k × k symmetric matrix SM whose element smij is the fraction of all edges in the network that link vertices in community p to vertices in community q. The similarity of smCpq be- tween the two clusters Cp and Cq,is defined as,[13] smCpq = ∑ cp∈Cp ∑ cp∈Cq cpq∑ cp∈C ∑ cq∈C cpq , p, q = 1, 2 · · ·m where cpq is the element in the similarity matrix SM. When p=q, the smCpq is the similarity between the elements inside the clusters, while p 6= q, the smCpq is the similarity between the cluster Cp and the cluster Cq. So the condition of a high quality clus- ter is arg max( ∑ p smCpp) and arg min( ∑ p,q smcpq), p 6= q, p, q = 1, 2, · · ·m. Summing over all pairs of vertices in the same group, the modularity, denoted Q, is given by: Q = m∑ p=1 [smcpp − ( m∑ q=1 smcpq) 2] = TrSM − ∥∥SM2∥∥ where the m is the amount of clusters. The trace of this matrix TrSM = m∑ p=1 smCpp gives the fraction of edges in the network that connect vertices in the same community Clearly a good division into communities should have a high value of this trace. If we place all vertices in a single community, the value of would get the maximal value of 1 because there’s no information about community structure at all. This quantity measures the fraction of the edges in the network that connects vertices of the same type minus the expected value of the same quantity in a 319 network with the same community divisions. Utilizing the value Q to evaluate the clusters [13] is a com- mon method: the values approaching Q=1, which is the maximum value, indicate the networks with strong community structure. In practice, the values of such networks typically range from 0 to 1. The higher value of Q, the better quality for the cluster corresponding to a predefined cluster number k. So examining the Q value allows us get the optimal number of clusters. 5.2 Experimental Results Optimal Cluster Number Selection Here we compare the result of Q values by using Spec- tral Clustering, Single Linkage Clustering and Random Clustering. Of the entire 521 user profiles constructed, we em- ploy various clustering algorithms to build up the group user profiles. We generate the cluster from 2 to 260 and utilize the modularity method to evaluate the results. We close the number until 260 because it is half of the total amount. When the number of cluster is higher than 260, the average number of members in each cluster is lower than 2, which will not provide rea- sonable clustering information. The results are shown in Fig3. that the value of Q for Spectral Clustering Algorithm is consistently higher than the other two algorithms. When the number is 25 the Q gets the maximal as 0.381. With the growth of the number of clusters, the value of Q is gradually decreasing to 0.19. It is concluded that, for this dataset, 25 clusters is the best choice. Figure 3: Comparison of the three algorithms on 521 users Of the entire 1399 document profile models con- structed, we generate the cluster from 5 to 700 and utilize the modularity method to evaluate the results. Similarly as the user profile, we close the number until 700. As shown in Fig4, when the number is 20 the Q gets to the maximum at 0.277. With the growth of number of clusters, the value of Q is decreasing to 0.026. We then found that, for this dataset, 20 clusters is the best choice. Demonstration of the Group Profiles It is shown that the 20 clusters is the optimal number for the 1399 documents, we take 2 clusters of them for analysis. One of the cluster contents 51 documents, with 239 tags. The main tag in it is about the “clas- sic”, “based on a book”, “black and white”, “National Film Registry”, “breakthroughs”, “Disney” and so on. Figure 4: Comparison of the three algorithms on 1399 documents The movies in such cluster seem related to the movies about life. And another cluster has 79 documents with 397 tags, the dominant tags are “action”, “organized crime”, “guns”, “hysterical”, “USA film registry”, “af- ternoon section”, “Oscar (Best Actor)”, “Oscar (Best Cinematography)”, “Oscar (Best Director)”, “Oscar (Best Picture)” and so on. Such movies tend to the Oscar movies with some breathtaking content. Comparison between the Time Consuming When the user or document profiles are used in tagging systems for further applications, similarity calculation is a major operation involved. An advantage of group profiling is the possibility of reducing the computa- tional complexity. For example, cosine similarity is of- ten executed to determine the ranking of candidates. The traditional way is to calculate the similarity be- tween its tags and each document’s tags. In such way the time complexity is the O(n). It will cost much time when the system has large dataset. After clustering for all of the documents, each clus- ter will have its own centroid as the representation of the group profile, which means the “center point” in the cluster. Centroid can be generated by the average frequency of tags inside the cluster. The similarity be- tween the centroid and the items inside the same clus- ter should be the highest; the similarity between the centroid and the items inside the other clusters should be the lowest. So the centroid is the representative of the cluster. If N documents have clustered into m communities, the process of similarity calculation is divided into two steps: firstly, calculate the simi- larity between the target tag vector and the centroids of all clusters to determine the cluster with highest similarity score; secondly, calculate the similarity of the target tag vector with the document profiles in- side the cluster to rank the whole documents. Since the number of centroids, m, is equal to the number of communities, which is highly lower than the number of documents, N, the time consuming of calculating simi- larity is dramatically reduced from O(N) to O(m+Nm ). In our experiment, the time consuming computing the similarity for all 1399 documents respectively is 152.83 seconds, however, it just needs 0.045 seconds by our proposed approach to get the final ranking of documents. 320 6 Conclusion and future work In this paper, we discuss an algorithm for user group profile and document group profile in social tagging systems. Utilizing the clustering algorithm, group pro- filing can be processed by the user profile and docu- ment profiles. We implement experiments on real tag- ging dataset to validate the proposed approach, inves- tigate the modularity method to compare the optimal number of clusters, and demonstrate the content of clusters. At last, we compare the time consuming for the similarity calculation involved in real applications. It is shown that the group profiling can be dealt with the tasks outlined in the paper effectively. For the future work, we intend to conduct research on the optimization for the algorithm, and explore the deployment of group profiling in tag-based recom- mender system. We will investigate clustering for tags which we believe should help in tag recommendation and representing user interests. References [1] F. Abel, N. Henze, D. Krause, and M. Kriesell. On the effect of group structures on ranking strategies in folksonomies. Weaving Services and People on the World Wide Web, pages 275–300, 2009. [2] K. R. Bayyapu and P. Dolog. Tag and Neighbour Based Recommender System for Medical Events. In Proceedings of MEDEX 2010: The First Inter- national Workshop on Web Science and Informa- tion Exchange in the Medical Web colocated with WWW 2010 conference, 2010. [3] J. Davies, D. Fensel, C. Bussler, and R. Studer. The Semantic Web: Research and Applications. In Proceedings of the First European Semantic Web Symposium, 2004. [4] I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the seventh ACM SIGKDD inter- national conference on Knowledge discovery and data mining, pages 269–274. ACM, 2001. [5] J. Diederich and T. Iofciu. Finding commu- nities of practice from user profiles based on folksonomies. In Proceedings of the 1st Inter- national Workshop on Building Technology En- hanced Learning solutions for Communities of Practice. Citeseer, 2006. [6] F. Durao and P. Dolog. A personalized tag-based recommendation in social web systems. Adapta- tion and Personalization for Web 2.0, page 40, 2009. [7] F. Durao and P. Dolog. Social and Behavioral Aspects of a Tag-Based Recommender System. In Ninth International Conference on Intelligent Systems Design and Applications, 2009. ISDA’09, pages 294–299, 2009. [8] F. Durao and P. Dolog. Extending a hybrid tag-based recommender system with personaliza- tion. In Proceedings of the 2010 ACM Symposium on Applied Computing, pages 1723–1727. ACM, 2010. [9] S. Guha, R. Rastogi, and K. Shim. Rock: A robust clustering algorithm for categorical at- tributes* 1. Information Systems, 25(5):345–366, 2000. [10] A. Hotho, R. Ja¨schke, C. Schmitz, and G. Stumme. Folkrank : A ranking algorithm for folksonomies. In K.-D. Althoff and M. Schaaf, editors, LWA, volume 1/2006 of Hildesheimer Informatik-Berichte, pages 111–114. University of Hildesheim, Institute of Computer Science, 2006. [11] Q. Mei and K. Church. Entropy of search logs: how hard is search? with personalization? with backoff? In Proceedings of the international conference on Web search and web data mining, pages 45–54. ACM, 2008. [12] A. Nanopoulos, H. H. Gabriel, and M. Spiliopoulou. Spectral Clustering in Social- Tagging Systems. Web Information Systems Engineering-WISE 2009, pages 87–100, 2009. [13] M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):26113, 2004. [14] M. G. Noll and C. Meinel. Web search person- alization via social bookmarking and tagging. In Proceedings of the 6th international semantic web conference and 2nd Asian conference on Asian semantic web, pages 367–380. Springer-Verlag, 2007. [15] S. Rendle, L. B. Marinho, A. Nanopoulos, and L. Schmidt-Thieme. Learning optimal ranking with tensor factorization for tag recommendation. In Proceedings of the 15th ACM SIGKDD inter- national conference on Knowledge discovery and data mining, pages 727–736. ACM, 2009. [16] J. Stoyanovich, S. Amer-Yahia, C. Marlow, and C. Yu. Leveraging tagging to model user interests in del. icio. us. AAAI SIP, 2008. [17] P. Symeonidis, A. Nanopoulos, and Y. Manolopoulos. A unified framework for providing recommendations in social tagging systems based on ternary semantic analysis. IEEE Transactions on Knowledge and Data Engineering, 2009. [18] J. Teevan, M. R. Morris, and S. Bush. Discov- ering and using groups to improve personalized search. In Proceedings of the Second ACM In- ternational Conference on Web Search and Data Mining, pages 15–24. ACM, 2009. [19] Y. Yang and N. Marques. User group profile mod- eling based on user transactional data for person- alized systems. Progress in Artificial Intelligence, pages 337–347, 2005. [20] C. M. A. Yeung, N. Gibbins, and N. Shadbolt. A study of user profile generation from folk- sonomies. In Social Web and Knowledge Manage- ment, Social Web 2008 Workshop at WWW2008. Citeseer, 2008. 321 Modeling, obtaining and storing data from social media tools with Artefact-Actor-Networks Wolfgang Reinhardt, Tobias Varlemann, Matthias Moi and Adrian Wilke University of Paderborn 33102 Paderborn, Germany {wolle,tobiashv,moisun,wilke}@uni-paderborn.de Abstract Social interaction between people has peerlessly changed with the availability of the Internet and the World Wide Web. The Internet brought new ways of communication technologies to live and enhanced people’s reachability, augmented pos- sibilities for personal presence and the sharing of information objects. People are engaging in so- cial networks in a steadily growing manner and share information objects within their commu- nities. The high initial amount of data in such networks can serve as foundation of serious in- vestigations towards social interactions of com- munities of learners. In this paper we introduce the technological foundation and architecture to model, obtain and store such user and object in- formation in so-called Artefact-Actor-Networks. Artefact-Actor-Networks combine classical so- cial networks with artefact networks that are con- structed by the use of the information objects and their connections. 1 Introduction The Internet has evolved to be the most frequently used medium of the 21st century and it is steadily growing. The so-called Web 2.0 movement [O’Reilly, 2005] en- gaged people in the active production of content on the Web and fostered technology-mediated human interactions in social networks. People are heavily taking part in so- cial networks for individual learning and knowledge work as well as for leisure activities. The huge amount of data in such networks can serve as foundation of serious investiga- tions towards social interactions in communities of learn- ers, knowledge workers or people spending their leisure activities online. There already is an existing body of knowledge about the properties of social networks such as Facebook (see [Ellison et al., 2007]), Twitter (see [Java et al., 2007; Ebner and Schiefner, 2008; Reinhardt et al., 2009a]), in Social Bookmarking systems and learning object reposi- tories [Vuorikari, 2009] but most often this work stay fo- cused on the relational and structural part of social network analysis [Granovetter, 1983]. Nevertheless, those studies are often combined with qualitative data from interviews or online questionnaires and thus can serve as useful in- formation source for future research. Despite their un- doubted usefulness, they do often not consider the arte- facts resulting from online interactions, their content, struc- ture and relations. Furthermore, the relations between arte- facts and their creators, editors or linkers are not consid- ered for making claims about a community. In [Reinhardt et al., 2009b] we introduced the model of Artefact-Actor- Networks (AANs)1 which tries to overcome those limita- tions by considering both content of artefacts and the rela- tions to actors interacting with them. As artefact we con- sider any digital information object that is shared within a communitiy (examples are simple HTML pages, status updates in microblogging services, entries in blogs, social bookmarks, online available documents, photos in picture sharing services and many more). In this paper we focus on the technical side of AANs and introduce the architecture to model, obtain and store such user and object information. 2 Obtaining and modeling data from social networks In this section we describe the concept of Artefact-Actor- Networks and the ontologies in place to model both actors in social networks and the artefacts resulting from their in- teraction. Moreover, we will reference existing standards and vocabularies for modeling objects and their interac- tions in online communities as well as metadata describing the respective properties of such objects. 2.1 Artefact-Actor-Networks Artefact-Actor-Networks (AANs) are an approach to se- mantically intertwine social networks with so-called arte- fact networks. The theoretical model and a first implemen- tation were first introduced by [Reinhardt et al., 2009b]. To connect artefacts and actors with each other, seman- tic relations are required. Relations in the network are connecting objects by a semantic context like isAuthor or isRightHolder. With the help of Artefact-Actor-Net- works participation in the life cycle of artefacts as well as significant connections to involved actors will be out- lined. Artefact-Actor-Networks consolidate multilayered social networks and artefact networks in an integrated net- work. Therefore, we consider the communication and col- laboration with each communication tool or artefact supply (e.g. Twitter, chats, e-mail or scientific documents) as a sin- gle layer of the respective network. We unite these single layers in both social and artefact networks to consolidated networks that contain all actors and artefacts respectively (cf. figure 1). While in the consolidated social network we can only make statements concerning the relations be- tween actors and in the consolidated artefact network we 1See http://artefact-actor-networks.net 323 can only analyze the relations between artefacts, Artefact- Actor-Networks (cf. figure 2) also contain semantic rela- tions between actors and artefacts. artefact network (e-mail) artefact network (documents) artefact network (chat) consolidated artefact network Figure 1: Consolidated artefact network resulting from three layers consolidated social network consolidated artefact network Figure 2: Artefact-Actor-Network with semantic relations between artefacts and actors In Artefact-Actor-Networks we discern three types of semantic relations: those between artefacts (ART2 rela- tions), those between actors (ACT2 relations) and finally relations that exist between actors and artefacts (AA rela- tions). Each kind of relation can be used for certain types of analyses and supports a different type of awareness in cooperative settings. ACT2 relations describe the nature of relationships between involved people. They characterize simple connections, friendships or kinships. Furthermore, they can show the kind of media people are communicating with. The Friend of a Friend (FOAF) project [FOAF, 2010] developed a RDF vocabulary to express interests, connec- tions and activities of people. ART2 relations on the other hand provide information on how artefacts are connected. The Dublin Core metadata standard and the SIOC project currently provide an expedient starting point [DCMI, 2010; SIOC, 2010]. Lastly, AA relations describe the seman- tics of relations between actors and the artefacts they inter- acted with. Dublin Core and SIOC provide useful relations to build upon, but the learning objects metadata standard (LOM) could be taken into account as well. 2.2 Relevant ontologies During the modeling of the application domain how- ever, we found out that we needed to extend the before- mentioned ontologies2 and vocabularies in order to cover the specifics of interaction with social media in learning networks. Thus, we created several ontologies for the so- cial media services we are analyzing (see Section 3.3) and made our ontologies publicly available3. The relations build on already existing standards for the modeling and storage of metadata and are further extended by our ap- plication. Figure 3 shows a simplified overview of the ontologies used in Artefact-Actor-Networks, where AAN- Base defines the basic entities Actor, Artefact and Key- word. AANMeta is an ontology that allows the aggregation of multiple actors (online handles) in one real person and the relationship between so-called groups to real people, their actors and artefacts related to a group (for example artefacts that are tagged with one of the groups tags). The AANOnline ontology is used to differentiate between arte- facts resulting from online actions and such artefacts that relate to activities taking place offline. The more specific tools and the respective ontologies are located towards the right of Figure 3. 3 The architecture of Artefact-Actor-Networks The architecture of Artefact-Actor-Networks is composed of a backend to which several frontends can be connected. The backend is responsible for processing and storing data and exposing this data to the frontends by well-defined in- terfaces. The architecture is based on the OSGi Service Platform, which is a component framework based on the Java plat- form. Due to the use of a specified component structure the backend is easily expandable and components can be deployed during runtime. The communication between the components is ensured with techniques from OSGi. It is the backend’s task to examine contents, to analyze, store and provide the annotated data. These four tasks are reflected in the according blocks in the AAN architecture (cf. Figure 4). The crawling block loads contents and generates (ac- cording to the ontology) structured data. The datastore block stores all data generated and serves as the connec- tion between all the existing blocks. The analyzer block contains several components with which a further refine- ment of the generated data is achieved. The fourth block provides the interfaces for various frontends. These blocks and the contained components are described in the follow- ing. 2Ontologies are a common way to model problem domains in an extensible and open format that is usable in various contexts. [Lohmann and Riechert, 2010] note the very precise and popu- lar definition of the term ontology given by [Gruber, 1993] who notes that an ontology is ”a specification of conceptualization” . Furthermore [Maalej et al., 2008] prod to the fact that ontologies normally are valid for a much longer time than conceptual mod- els for example, as they describe a broader application domain and some more general knowledge facts. 3See http://artefact-actor-networks.net/ ontologies/2010/03/ 324 Figure 3: Simplified overview of the AAN Ontologies 3.1 Crawling-Block At the crawling block, requests for content analysis are pro- cessed. These requests are passed to the crawling block via OSGi-Services and analyzed by a processing chain. There are two interfaces: The components Crawler and Crawler- Manager. The CrawlerManager contains high-level func- tions by which the Crawler is controlled. The processing chain is controlled by the crawling component. It consists of the components Accessor, which is reading the content, MimeTyper, to identifier the type, and Parser, which ana- lyzes the content. Crawler The Crawler component provides low-level functions to process tasks. Tasks are committed as URIs of the content. For such an URI, the Crawler executes the processing chain by loading, determining the MIME type and analyzing the content. The selection of the Accessor component depends on the used protocol and the URI. Accordingly, the MIME type is determined, what in turn selects the Parser to store contents in the datastore block. While the Parser is chosen, it is taken into consideration, if a specialized Parser exists that is able to handle variations of the detected MIME type. An example is a request of a web service, which supplies a MIME type text/xml. The use of a Parser that is able to handle this special type is more practicable than the use of a Parser, which can process all XML formats. The processing of tasks occurs asynchronously by the use of a thread pool with which several tasks can be exe- cuted parallel. CrawlerManager A superordinate of the Crawler component is the Crawler- Manager, which is using the services of the Crawler. The CrawlerManager provides functions of higher levels than the Crawler. Whereas the Crawler is receiving tasks via an exact URI, the CrawlerManager is designed to handle more complex jobs. It is possible to deal with individualities of web pages or investigate the structure of a page by follow- ing hyperlinks in HTML documents. To handle special tasks, there can be various imple- mentations of the CrawlerManager. Two examples, the GenericCrawlerManager and the MediaWikiCrawlerMan- ager, can be seen in the architecture (figure 4). The Gener- icCrawlerManager is able to process timed tasks, which can generate follow-up tasks. The MediaWikiCrawlerManager is specialized to the structure of MediaWiki pages and is able to crawl contents of an entire wiki. Accessor The first element of the processing chain is the Accessor component. It is responsible for reading content. The com- ponent is responding to the Crawler, which hands over an URI of the content to load. The Accessor component ac- cesses the resource, stores it in a local file, and returns a reference to the file. Different types of the Accessor component are conceiv- able. Those could enable the access to protocols like HTTP, FTP, SMTP or SVN. MimeTyper For further processing of a resource by a Parser the MIME type of resource is determined. This improves the choice of an appropriate Parser in the last stage of the processing chain. The MimeTyper is loading the temporary, local file of the content to analyze. Then it determines one or more MIME types of the content and returns it to the Crawler. 325 Figure 4: The AAN architecture 326 Parser The Parser is the last stage of the processing chain. It ex- tracts relevant information from the resources loaded by the Accessor, to build up an Artefact-Actor-Network. Different variations of Parsers are distinguished by the Parser type and the MIME types which can be processed. Types are setting in advance the order, in which the Parsers are cho- sen. Parsers which can handle a specialized MIME type are preferred to general Parsers. This guarantees the choice of a Parser which can process contents optimally. 3.2 DataStore block The DataStore block is responsible for the storage of all data produced in the application. For this purpose, four components are available to accommodate the differently structured data. Other components will be notified of any changes via an event service. All components in the Data- Store block send custom events for communication pur- poses. DataStore The Parsers generate RDF data from the analyzed contents that is stored in the DataStore according to the respective ontology. RDF data provides several advantages for Arte- fact-Actor-Networks. In addition to the flexibility of the data structure and the ability to generate descriptions of the data at runtime, there is a wide range of storage engines that are adapted to the processing of RDF data. The backend uses the Jena Framework4 to store and query the data. With Jena data can be queried both in source code and via the RDF data query language SPARQL5. The DataStore stores the data in a file format defined by the Jena Framework but one can also use in-memory storage or a RDBMS. FullTextStore Some analytical techniques for text-based content – such as keywords, named entities or related languages – require the existence of a full text. The texts can be extracted by a Parser and will be stored in the FullTextStore. The FullTextStore will notify components in the analysis block about the existence of a full text that can be retrieved via an event system. RelevanceStore The RelevanceStore holds information about the relevance of a keyword. This relevances are used during the calcu- lation of semantic similarity of artefacts. In our architec- ture the FullTextStore and the RelevanceStore were incor- porated in one component. We use the Apache Lucene en- gine to efficiently store the full texts and to query the Full- TextStore for the relevance of single keywords. SimilarityStore The SimilarityStore manages all calculated similarity val- ues between artefacts, actors, persons and groups. Sim- ilarity values are created by various similarity analyzers, which store their results in the SimilarityStore according to an analyzer-specific order. 3.3 Analysis block The Analysis block includes all components for analyzing stored data of components from the DataStore block. Com- ponents in this block only listen to events fired from Data- Store components. We distinguish between NetworkAnal- 4http://openjena.org/ 5http://www.w3.org/TR/rdf-sparql-query/ ysers, TextAnalysers and RelevanceAnalysers. Network- Analysers listen to events fired from the DataStore, Text- analysers listen to events from the FulltextStore and Rele- vanceAnalysers to events from the RelevanceStore. NetworkAnalyser NetworkAnalysers are using the data pool of the DataStore to analyse network structure. They will be informed to ev- ery change in the data pool of the DataStore. TextAnalyser TextAnalysers start working upon the reception of Full- TextStore events. Currently we implemented Or- chestr8Analyser, OpenCalaisAnalyser and CosineAnal- yser. The Orchestr8Analyser and OpenCalaisAnalyser are making use of the webservices of Orchestr8 and Open- Calais respectively to extract keywords and named entities from a given text. Extracted data will be stored as addi- tional parts of resources. RelevanceAnalyser RelevanceAnalysers are components to determine the simi- larity of resources. An example for such an algorithm could be the SemSim algorithm that we developed (see [Rein- hardt et al., 2009b] for more detail on the algorithm). Sem- Sim allows the calculation of semantical similarity based on the keywords and named entities stored for artefacts. The calculated values are then stored in the SimilarityS- tore. 3.4 Connection block The fourth block in our architecture is the Connection block that encapsulates Webservices and the visualization component. Both components expose data stored in Arte- fact-Actor-Networks or API functionalities to external con- sumers over HTTP interfaces. Webservices Internal server components such as the CrawlerManager, the CrawlerManager or the DataStore are providing their functionalities via webservice interfaces as an API to ex- ternal applications. Using the API, crawling jobs can be placed or data from the DataStore can be requested. The DataStore provides tailored API functions as well as a SPARQL interface to the AAN. Visualization component The visualization component converts the contents of Arte- fact-Actor-Networks into a XML-based graphics format that can be displayed and explored in the appropriate view- ers. The preparation of those files is costly and will be a great burden for the AAN server. In order to reduce the ef- forts we implement a caching strategy that will only keep those projects up-to-date that are being accessed regularly. 3.5 Technical note The architecture is based on the OSGi service platform6 a component framework based on the Java platform. The OSGi service platform is a specification of the OSGi Al- liance, which has been implemented in various imple- mentations. Amongst those implementations are Eclipse Equinox7 and Apache Felix8. The frameworks support the live deployment of components, which allows to add a new Parser without the need to restart the server. 6http://www.osgi.org 7http://www.eclipse.org/equinox 8http://felix.apache.org 327 OSGi also allows the dynamic communication of com- ponents with each other. For this purpose, a service system is used that allows components to register services that then can be used by other components. The AAN architecture makes heavy use of this approach, as all components pro- vide specialized services and await use. As an example, each Parser has to provide a service that can be accessed via the two methods isParsable and parse of the Crawler component. 4 Social media tools under investigation In its current implementation of AANs we store and ana- lyze data from four different social media tools: (1) Twit- ter, (2) Delicious, (3) SlideShare and (4) MediaWiki. All of the tools are used by researchers during their daily work routines (see for example [Heinze et al., 2010] for an in- spection of tools used by researchers) and make specific demands on the respective components in the AAN archi- tecture. In the following we present the specifics of the single components. 4.1 The specifics of the Twitter component Twitter9 is a microblogging service, which allow users to publish short messages with an length under 140 char- acters. These messages are typically public and can be viewed over different channels. The TwitterParser com- ponent use the answers of the TwitterAPI in XML or JSON format to parse the information of a single Tweet(Status), a Twitteruser, a users timeline, the followers of a User or a search request. The TwitterParser possesses a special com- ponent for each of this functions which will be described now. The StatusComponent parses the XML answer of the TwitterAPI /show/status request. It extracts the information about the status and the user who created it. The extracted information is kept in the DataStore as listed below: CreationTime the creation time of the status. Statusid the Twitter id of the status. Replyid the Twitter id of the status to which the current status is the reply. User the information of the creator of this status. This will use the UserComponent. WebURL the URL to this status as an HTML page. Hashtags the hashtags of this status. ExternalLinks hyperlinks that this status may contain. Text the full text of the status that will be stored in the FullTextStore. TwitterAPI /show/user requests are processed in the UserComponent which extracts information about the Twitter user and stores them in the DataStore. The ex- tracted information are listed below: UserID the Twitter id of the user. Screenname the screen name of the user. Username the real name of the user. Location the location where the user lives. Description the description of the user which was entered in Twitter. URL the internet address of the user. 9http://www.twitter.com/ CreationTime the date at which the user registered at Twitter. Last Status the last status of the user (will passed to the StatusComponent for analysis). The TwitterAPI response for a timeline contains a se- ries of statuses. This series will be separated by the Time- lineComponent into the single statuses. At the last step the TimelineComponent will forward each single status to the StatusComponent which will extract the information. The /show/followersid API call responds with a list of TwitteruserIDs which will be parsed and returned to the Crawler as follow-up links, which can be followed by a CrawlerManager to parse the user information of the fol- lowers. The response from the TwitterSearchAPI is computed the same way as followers. The StatusID will be extracted from the response and returned to the Crawler as follow- up links. This is necessary because Twitter uses a different data structure in the SearchAPI as in the rest of the Twitter- API. 4.2 The specifics of the Delicious component One of the integrated data sources is the social bookmark- ing service Delicious10. Delicious can be used to store per- sonal bookmarks on the web and share them with others. During the creation of bookmarks users have the opportu- nity to add notes and tags to describe and categorize their input. By adding this additional data, especially the tags, artefact-networks are created. On the one hand bookmarks of different users form networks by relations resulting from their tags. On the other hand all bookmarks of a user are connected to the user himself. Besides these artefact- networks, actor-networks can also be found at Delicious. These are formed while users add other Delicious users to their personal network. By these relations, some users are connected indirectly, as well as their bookmarks are con- nected additionally. In summary, Delicious provides both types of networks that form the base of Artefact-Actor- Networks. What is the most practicable way of proceeding to get it? Data access Delicious offers a huge amount of possibilities for develop- ers to access the available data11. Depending on the desired outcome, one can choose from different interfaces, e.g. an API, feeds or link-rolls. Generally, one of the most applied ways to access data is by the use of an API. We also tested this way for applicability to our system. The offered API- methods are custom-made for the access and use of a users personal data. A user can create, edit and receive personal bookmarks, tags and tag bundles. As the idea of AAN is the extraction and analysis of public data, and the need of a user-authentication by the API is a hindrance, there was a need for more useful data access. A more utilizable approach to get data is the use of De- licious feeds12. Feeds are offered in JSON and RSS for- mat and provide an access to public data. It is possible to get the latest bookmarks, tags and network members of a specified user. Furthermore, requests for bookmarks can be refined by combining a specific username and tags. More- over, recent bookmarks for an URL can also be accessed what forms an extensive base for information retrieval. 10http://delicious.com/ 11http://delicious.com/help/tools 12http://delicious.com/help/feeds 328 As feeds are mainly used for receiving the latest informa- tion, most of the feeds are provided as a list of recent book- marks. Furthermore, the Delicious feeds are limited to 100 entries per request and as there is also a limitation of one re- quest per second we encountered a restriction for crawling the entire bookmarks of a user. This fact was partly solved by multiple recursive requests. If a user has described his bookmarks with the tags A, B and C, first the bookmarks described with tag A are requested. If the returned set of bookmarks amounts to 100 entires, a combined request of tag A and B is sent. If the result of this request amounts to 100 entries again, a refinement by an additional tag is used. Otherwise, the bookmarks of the tags A and C are requested to get a result that is as complete as possible. Within our system, the DeliciousParser is setting prop- erties for contents of specific feed calls. With these prop- erties, extracted tags and the count of returned artefact en- tries are stored. This forms the basis, by which the Deli- ciousCrawlerManager is generating feed URLs to follow up crawling a complete set of bookmarks. Finally, the received feed-data is mapped to the defined ontology and added to the AAN model. 4.3 The specifics of the SlideShare component SlideShare13 is a Web 2.0 platform which offers users the possibility to share presentations and documents. A user can upload files in PDF and common office formats and is able to define metadata like tags, category and visibility information. Published slides can be favored, rated, com- mented on, downloaded and shared with others. For developers, SlideShare provides an API14 with which public and private data can be accessed. Public API methods require an optional user authorization. By using public methods, developers can request documents related to users or tags. Additionally, a user’s tags or contacts can be requested as well. For accessing data of the SlideShare network, the ex- pandability of the AAN framework is used. Here we took advantage of the clear URL scheme of the API methods and thus the existing components CrawlerManager and Zentral- CrawlerManager work together with the SlideShareParser, a specialized parser to analyze the incoming SlideShare data. New crawling tasks are added to one of the Crawler- Managers. If such a task consists of a SlideShare URI, the specialized SlideShareParser determines that it is able to handle the given input. If the parser is chosen it firstly analyzes the URI scheme. In the following, artefacts, ac- tors and metadata are extracted and stored accordingly to the specified ontology. Finally, API URLs of related ac- tors, artefacts and keywords are generated and added to the crawling queue. 4.4 The specifics of the MediaWiki component The MediaWiki component was designed to receive as much information as possible from a MediaWiki installa- tions such as WikiPedia. First we designed a specialized MediaWikiCrawlerManager, which is able to control the crawling process to crawl and observe a complete Medi- aWiki or just a single page. Furthermore, we implemented the MediaWikiParser, who’s task it is to parse input that was received by the Crawler (see Section 3.1). The Medi- aWikiParser stores extracted information like internal and external links or information about the author within the 13http://www.slideshare.net/ 14http://www.slideshare.net/developers/documentation/ DataStore component. The full text will be stored with the FullTextStore component. We distinguish between three different types of jobs handled by the MediaWikiCrawler- Manager, which will be discussed below. Crawling single pages In this case, the MediaWikiCrawlerManager handles the job in four steps. First a unique URL as a permalink15 will be generated. As discussed before every object, like a MediaWiki article is represented as a unique artefact in the Artefact-Actor-Network. Secondly the MediaWik- iCrawlerManager generates an MediaWiki API-Query of the type ’parse’. The required information is initially re- ceived from the MediaWiki server of the article. Note, that this the query is not executed by the MediaWikiCrawler- Manager but only the appropriate URL will be created. In the third step the MediaWikiCrawlerManager calls the Crawler to add a new crawl task. The resource will be fetched by one of the accessor components. If the Medi- aWikiParser is registered and started, the MediaWikiParser parses the resource because of the distinction between spe- cial and general parsers. Further Properties By adding a new job to crawl only a single page one is able to define the properties to resolve internal and external links by specifying a depth value. If a the depth for internal link is 1 for example, the MediaWiki- CrawlerManager will add new jobs for all received internal links. Crawling a full MediaWiki The MediaWikiCrawler can crawl a complete MediaWiki with all its articles in the latest revision or with all its arti- cles with a specifiable number of revisions. By adding a job to crawl a complete MediaWiki, the MediaWikiCrawler- Manager executes an API query to receive a list of all arti- cles, including the latest revision information about an ar- ticle. If one wants to get more then the current revision of an article, the MediaWikiCrawlerManager executes queries to get basically needed information about each revision. With this information, the MediaWikiCrawlerManager fi- nally generates MediaWiki API queries of the type parse to add new single page jobs. Each of the generated jobs will be handled like described in the section about crawling sin- gle pages. Another important note is the limitation on the MediaWiki API. Only 500 article or revision entries can be received with one query. This is solved by executing serial queries. Further properties For this job type the properties re- vision count and the depth of external links can be speci- fied. The property revision count can be a positive integer values or -1 to crawl all revisions of all pages. The pa- rameter about the depth of the external links is the same as described in the section about crawling single pages. It will be propagated to each created single page job. Observing a MediaWiki To keep AAN data up to date, the MediaWikiCrawlerMan- ager supports the observation of a MediaWiki. After crawl- ing a full MediaWiki initially, it regularly checks the Medi- aWiki about changes. This means that you must not always 15Permalink is the unique URL of a MediaWiki page; e.g. http://en.wikipedia.org/w/index.php?title= Java&oldid=366545644 329 crawl and parse the complete MediaWiki, which would consume to much time. Only changes since the last suc- cessful crawling will be considered in a new crawling job. General handling of jobs All generated jobs will be stored in a threaded queue by scheduling first-come-first-serve. Each job is represented as a single thread, which allows to handle more than one job in parallel. 5 Outlook and further R&D opportunities Artefact-Actor-Networks are analyzing interactions of learners with artefacts that are used for individual and or- ganisational learning. The semantically enriched data is then exposes via an open API to be included in various user interfaces. In [Reinhardt, 2010] we introduce the AANalyzer as the first awareness dashboard that build on the AAN model and will be applied to several learning communities in the course of the year 2010, which will help us to gain user feedback on the awareness support the tool offers. At the same time we are extending both the number of social me- dia tools available for analysis in AANs as well as the qual- ity of the AAN backend implementation. First functional tests with more than 400.000 nodes in an Artefact-Actor- Network revealed the need for improvements regarding the inferring of semantical data stored. Besides the long run- times to calculate the inferred models, any changes in the data model require a rebuilding of the inferred model. Fur- thermore, the calculation of semantic similarity between artefacts, actors and groups in AANs is a challenging en- deavor to overcome. At the moment the calculations are done in an online algorithm whose runtime is exponen- tial to the number of artefacts in the DataStore. We strive for implementing an offline algorithm that makes use of caching strategies and a the data in the SimilarityStore. Regarding the variety of awareness widgets for the users of the AANalyzer, we will extend the choice with statistics widgets and an advanced word cloud implementation that will allow for the visualisation of timely changes in the im- portance and use of certain terms. References [DCMI, 2010] DCMI. Dublin Core Metadata Initiative. http://dublincore.org/, 2010. [Ebner and Schiefner, 2008] Martin Ebner and Mandy Schiefner. Microblogging - more than fun? In Proceed- ings of the IADIS Mobile Learning Conference, pages 155–159, 2008. [Ellison et al., 2007] N.B. Ellison, C. Steinfield, and C. Lampe. The benefits of Facebook” friends:” social capital and college students’ use of online social net- work sites. Journal of Computer Mediated Communica- tion (Electronic Edition), 12(4):1143, 2007. [FOAF, 2010] FOAF. The Friend of a Friend (FOAF) project. http://www.foaf-project.org/, 2010. [Granovetter, 1983] M. Granovetter. The strength of weak ties: A network theory revisited. Sociological theory, 1:201–233, 1983. [Gruber, 1993] T.R. Gruber. A Translation Approach to Portable Ontology Specifications. Knowledge Acquisi- tion, 5(2):199–220, 1993. [Heinze et al., 2010] N. Heinze, P. Bauer, U. Hofmann, and J. Ehle. Kollaboration und Kooperation in verteil- ten Forschungsnetzwerken durch Web-basierte Medien – Web 2.0 Tools in der Wissenschaft. In Forthcoming Proceedings of the GMW 2010 conference, 2010. [Java et al., 2007] Akshay Java, Xiaodan Song, Tim Finin, and Belle Tseng. Why we twitter: Understanding mi- croblogging usage and communities. In Procedings of the Joint 9th WEBKDD and 1st SNA-KDD Workshop 2007, August 2007. [Lohmann and Riechert, 2010] S. Lohmann and T. Riechert. Adding Semantics to Social Software Engineering: (Re-)Using Ontologies in a Community- oriented Requirements Engineering Environment. In Workshop-Proceedings of Software Engineering 2010, pages 485–494, 2010. [Maalej et al., 2008] W. Maalej, D. Panagiotou, and H.-J. Happel. Towards Effective Management of Software Knowledge Exploiting the Semantic Web Paradigm. In Proceedings of Software Engineering 2008, pages 183– 197, 2008. [O’Reilly, 2005] T. O’Reilly. What is Web 2.0 – Design Patterns and Business Models for the Next Genera- tion of Software. http://oreilly.com/pub/ a/web2/archive/what-is-web-20.html, September 2005. [Reinhardt et al., 2009a] Wolfgang Reinhardt, Martin Ebner, Guenter Beham, and Cristina Costa. How people are using Twitter during conferences. In V. Hornung- Pra¨hauser and M. Luckmann, editors, Creativity and Innovation Competencies on the Web. Proceedings of the 5th EduMedia 2009, Salzburg, pages 145–156, 2009. [Reinhardt et al., 2009b] Wolfgang Reinhardt, Matthias Moi, and Tobias Varlemann. Artefact-Actor-Networks as tie between social networks and artefact networks. In Proceedings of the 5th International ICST Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom 2009), November 2009. [Reinhardt, 2010] Wolfgang Reinhardt. A widget-based dashboard approach for awareness and reflection in online learning communities based on Artefact-Actor- Networks. In Forthcoming Proceedings of the First PLE Conference 2010, 2010. [SIOC, 2010] SIOC. The semantically-interlinked online communities (sioc) project. http: //sioc-project.org/, 2010. [Vuorikari, 2009] Riina Vuorikari. Tags and self- organisation: a metadata ecology for learning re- sources in a multilingual context. PhD thesis, Open Uni- versity of the Netherlands, 2009. 330 Abstract Recommendation Systems are central in current applications to help the user find useful informa- tion spread in large amounts of post, videos or social networks. Most Recommendation Systems are more effective when huge amounts of user data are available in order to calculate similari- ties between users. Educational applications are not popular enough in order to generate large amount of data. In this context, rule-based Rec- ommendation Systems are a better solution. Rules are in most cases written a priori by do- main experts; they can offer good recommenda- tions with even no application of usage informa- tion. However large rule-sets are hard to main- tain, reengineer and adapt to user goals and pre- ferences. Meta-rules, rules that generate rules, can generalize a rule-set providing bases for adaptation, reengineering and on the fly genera- tion. In this paper, the authors expose the bene- fits of meta-rules implemented as part of a meta- rule based Recommendation System. This is an effective solution to provide a personalized rec- ommendation to the learner, and constitutes a new approach in rule-based Recommendation Systems. 1 Introduction Nowadays, when the amount of information is becoming over-exceeding, Recommendation Systems emerge as the solution to find the small piece of gold in mountains of garbage. In electronic commerce, knowledge management systems, social networks, and other fields and markets, they help users find useful products, lessons or contribu- tions. There are many inputs which can be used as infor- mation sources like i.e. user similarities with other users, user profile, and user preferences. All these inputs provide the system with valuable data to suggest the user the best way to follow or the most appropriate choice. Further- more, people ratings [Rocchio, 1971] are another impor- tant source of information for Recommendation Systems. Other sources of information are i.e. user interests, goals, and objectives, all of them more useful for educa- tional applications. However, educational applications lack of enough amounts of data to establish user similari- ties in a precise way. In this case, recommendations are based on information stored in a user model which is ex- tended explicitly or implicitly. There are also hybrid ap- proaches which ask some minimum information to the user and the rest is obtained in an implicit way. For educational applications rule-based Recommenda- tion Systems have proved as more useful than other sys- tems [Abel et al., 2008]. In general acceptable recommen- dations can be obtained with a small amount of informa- tion. However, when the system achieves a better know- ledge of the user, recommendations increase precision since rules evolve in parallel or new ones are included to the rule-set. In general expressing user preferences, goals and interest with rules can be difficult [Anderson et al., 2003] to solve this problem complex and large rule-sets are generated. This solution carries another problem: the size and complexity of the rule-set can be unaffordable. In addition, it would be desirable to generate rules based on data extracted from a database or from the user, increasing this way user adaptability. Such generation could also be on the fly, allowing the rule-set to be up to date. In this paper we propose a solution to this problem by the introduction of a new abstraction level: meta-rules, which provide foundations for effective adaptive and per- sonalized processes, such as in, e.g. learning. This ap- proach has been implemented in the context of Meta- Mender, our meta-rule-based Recommendation System. In this paper, we also describe the Meta-Mender architecture and implementation to contextualize the meta-rules ap- proach. 2 Background and Related Work In the field of educational recommendations there are ap- proaches like [El Helou et al., 2009] which use a modified page ranking algorithm for the generation of recommen- dations. The algorithm considers actors, activities and resources as main entities. Here user’ activity is used to create a directed graph representing the entities and links between them are generated. Later a rank is assigned to nodes and this information is used to generate recommen- dations for a user query. This work is valuable for us be- cause in general meta-rules use as input the user’s activity for the automatic generation of rules. This input comes as in the referenced paper from actors, activities and re- sources. The main difference is that in our approach it will be generated a set of rules instead of a directed graph. In the field of rule-based recommenders there are some relevant reports in the literature [Abel et al., 2008], for example, describes a rule based Recommendation System for online discussion forums for the educational online board Comtella-D. Actually the system is able to call sev- eral encapsulated recommenders, collaborative filtering or Meta-rules: Improving Adaptation in Recommendation Systems Vicente Arturo Romero Zaldivar 1 and Daniel Burgos 1,2 1 Atos Origin SAE, Albarracin 25, Madrid 28037, Spain {vicente.romero, daniel.burgos}@atosresearch.eu www.atosresearch.eu 2 International University of La Rioja Gran Via Rey Juan Carlos I 41, 26002 Logroño, La Rioja, Spain www.unir.net 331 content-based recommenders, and the rules decide accord- ing to the amount and type of user data which recom- mender should be called. In doing so, the rules define a meta-recommender which is very interesting but tangen- tial to this work. An advantage of rule-based recommenders against oth- er approaches is how easy it is to generate explanations for these systems. In many cases, it is almost impossible to explain to the user how a Recommendation System has derived a conclusion. If the user is not sure of a given recommendation and-or prefers to receive some logic- supported arguments which help him to choose a given solution, rules-based systems are the best ones prepared to solve this issue. This problem is usual in automated colla- borative filtering systems [Herlocker, 1999], where the lack of explanations decreases the system acceptance and affects user trust. RACOFI [Anderson et al., 2003] is defined by its au- thors as a rule-applying collaborative filtering system. This system is a hybrid conformed by a collaborative fil- tering recommender plus a rule-based recommender. It was designed for recommending Canadian music but its authors argue that the system is content independent. RACOFI uses rules to modify, for example, ratings of items based on item similarity. This means that if a user rates an album as highly original, other albums’ originali- ty of the same author will be incremented. This paper is very useful for us because it contains a large set or rules which will be used to prove the synthesis power of a me- ta-rule approach. With regards to meta-rules, we have not found anything similar to our proposal. Initially used by LISP [McCarthy et al., 1985] and other languages, the closest concept is rule templates, followed by Open Rules 1 and the Object Oriented RuleML 2 approach of handling rules as data, thus generating entire rules from its component parts. These approaches are very valuable. However, our ap- proach of producing rules using imperative programming comprises these two approaches and, at the same time, it seems to be more powerful. For instance, Meta-Mender can easily generate meta-rules, or other supplementary features, easily. In addition, as it will be shown in the next sections, using meta-rules in Rule-based Recommender Systems is not common, and it provides other benefits like maintainability, reengineering and on-the-fly generation. Finally, the introduction of another level of abstraction is very valuable for adaptation and performance. For the implementation of a rule-based recommender using a rule-management system can be of great help. In this respect, Meta-Mender makes use of DROOLS [DROOLS] as rule engine. DROOLS is a business rule management system (BRMS) with a forward chaining inference-based rules engine (the so-called rule system). This system makes use of an enhanced implementation of the Rete algorithm [Forgy, 1982]. DROOLS is designed to allow plug-able language implementations. Currently, rules can be written in Java 3 , MVEL 4 , Python 5 , and Groovy 6 . It is also possible to write functions to be ex- ecuted as the consequence of any rule; this feature has 1 http://openrules.com/index.htm 2 http://ruleml.org/indoo/indoo.html 3 http://www.sun.com/java/ 4 http://mvel.codehaus.org/ 5 http://www.python.org/ 6 http://groovy.codehaus.org/ been used to generate rules from meta-rules. It is also possible to assign a priority to rules, which is a way to address the execution order by the rule designer. 3 The Meta-Mender Architecture and Im- plementation As aforementioned, the Meta-Mender Recommendation System uses DROOLS as the rule engine. This rule sys- tem works as follows, first it is required to feed the engine with a set of rules (or meta-rules in this case); these meta- rules are defined by the professor or by the technical team using the application requirements. To define meta-rules is a more complex task than to define rules. In this re- spect, future research will be done in order to generate meta-rules automatically. A meta-rule example will be shown in the next subsection. Later some facts should be added to the engine. These facts are extracted commonly from a database, be it relational or ontological. As the facts are inserted, rules antecedents are checked for com- pleteness and once the engine is started, rules that fulfill its antecedents are fired and the corresponding conse- quences are executed. The order of execution is arbitrary, so rule priority must be stated if the order of execution is important for the final result. In our case the order is im- portant because the output of this iteration will be a file of rules and these files have a structure, so the rules that gen- erate the header must be executed before the rules that generate the body. The output is used for a second itera- tion from which the final recommendations are obtained. See Figure 1 for a simplified representation of the recom- mender architecture and Figure 2 for an example of a me- ta-rules file, this file can generate rules for recommending the next course to follow in a .LRN educational applica- tion. See that at the consequence of the meta-rules a func- tion is called, this function receives some information that allows it to write the rules wanted. Figure 1. The Meta-Mender Architecture It is worth mentioning that in DROOLS, rule conditions cannot contain functions, this is a common issue in rule systems the behind reason is performance. The condition is formed by a conjunction of patterns that must be matched by the inference algorithm. These patterns cor- respond at the end with object instances. So rule condi- tions are in a way fixed and must be defined statically. The only way to adapt a rule-set to changing conditions is by generating rules dynamically. These changing condi- tions can be: • Changes in the application architecture, this includes the addition of a new forum or social network. 332 • Changes in membership conditions, access restric- tions, or similar. • Changes in model, new classes addition, modifica- tion of existing classes, etc. It is clear that a static defined condition cannot cope with all these changes. So if there is a way for generating rules according to current system conditions and proper- ties, it could save time and effort as long as increase adap- tation. See Figure 3 for an example of a rules file obtained from the meta-rules file of Figure 2. In this figure it can be noted that the recommendation list is created as rules con- sequences are executed. The function AddRecommen- dedCourse() builds this list and the parameter (10 - $courseLevel) allows giving more priority to those courses with a lesser level of complexity. This criterion can be personalized to every student. Meta-rules can handle changing conditions. Different rules can be generated depending on user data, system properties, etc.; also rules priority can be different which implies different recommendations. package service //global variables definitions global java.lang.String filePath; //rules rule Header //empty antecedent, executes always when then WriteHeader(filePath); end rule CourseSequenceRule salience -1 //priority when //antecedent //classId is an object field LRNClassData($classId : classId, $className : className) LRNStudentData($studentId : studentId, $studentName : studentName) LRNClassPerStudent(classId == $classId, studentId == $studentId) then WriteClassMembershipRule(filePath, $studentName, $className); end //function implementation function void WriteHeader( String filePath) { ... } function void WriteClassMembershipRule( String filePath, String studentName, String className) { ... } Figure 2. A Meta-rule File Rule output can be related with rule priority since it will be possible to order the results and give the user a list of recommendations ordered by priorities. 4 Practical Implementations of the Meta- Mender Recommendation System The Meta-Mender recommender is been used at present as part of two projects. The first one, TELMA 7 , is focused on the application of Communications and Information Technologies in lifelong training of surgeons of Minimal- ly Invasive Surgery (MIS). TELMA develops an online learning environment which manages the content and knowledge generated by users in an efficient way. TELMA creates a new training strategy based on knowledge management, cooperative work and communications and information technologies aiming to the improvement of the formation process of the surgeons of MIS. In order to cover all these needs and achieve its goals, TELMA develops a cooperative and adaptable learning platform. This platform is composed of a training system which integrates a tool for the authoring of didactic mul- timedia content, a Recommendation System and a social network. The Recommendation System, Meta-Mender, manages the knowledge generated by the users creating the foundations for adaptive learning. The learning plat- form allows the construction of a complete knowledge base thanks to the reuse and sharing of the knowledge generated for the professionals who access the learning system. package service global domain.RulesOutData outData; rule MathI when UserData($userName : name, $userId : id) Course($courseLevel : courseLevel, $courseName : name, id == 1) not (TotalCoursePercentage(userId == $userId, courseId == 1)) then outData.AddRecommendedCourse( $courseName, 10 - $courseLevel, $userName); end rule AlgebraI when UserData($userName : name, $userId : id) Course($courseLevel : courseLevel, $courseName : name, id == 2) not (TotalCoursePercentage( userId == $userId, courseId == 2)) exists (TotalCoursePercentage( userId == $userId, courseId == 1)) then outData.AddRecommendedCourse( $courseName, 10 - $courseLevel, $userName); end Figure 3. A Rules File Obtained from the Meta-rules File of Figure 2 The second project, GAME·TEL 8 , is focused on the creation of a system for the design, development, execu- tion and evaluation of educational games and simulations, 7 www.ines.org.es/telma 8 www.ines.org.es/gametel 333 adapted to student preferences, educational goals, profile, [Burgos et al., 2007]. The games and simulations are conversational adven- tures which are both usable and understandable. These characteristics benefits to the content creator, the profes- sor in most cases, as long as to the final user, usually the student [Moreno-Ger et al., 2008; Torrente et al., 2008]. The software system is composed of several intercon- nected modules which allow the integration and inter- communication between games and simulations and sev- eral tools widely used for communities of professors. In- itially these tools are the learning management systems Moodle and .LRN, and the authoring and learning units execution system LAMS [Burgos et al., 2006; Moreno- Ger et al., 2006]. In this context the Meta-Mender Rec- ommendation System will suggest to users the best learn- ing path according with their preferences, goals and objec- tives and will help with the game adaptation problem. As an example, the following meta-rule allows the generation of rules for the forums that the user has access to, see Fig- ure 4. rule ForumRecommendationRule salience -1 when TelmaUser($userId : userId, $userName : userName, $mainInterest: mainInterest) TelmaForum($forumId : forumId, $mainTopic : mainTopic) TelmaForumAccessPerStudent(forumId == $forumId, userId == $userId) then WriteForumRecommendationRules(filePath, $userId, $userName, $forumId, $mainTopic); end Figure 4. A Meta-rule from Telma Application This kind of rules allows for adaptation in case of the addition of a new forum to the application, a fact that can happen at any moment. modify(amount->"0.5"; comment->"Adjusting originality rating (by 0.5) for high ratings of other albums by this artist."; variable->originality; product->?item) :- rating(itemID->?item2; originality->"9.0"!?REST0), product(itemID->?item2; artist->?artist!?REST1), product(itemID->?item; artist->?artist!?REST2). Figure 5. A Modify Rule from the RACOFI System 5 Expressive Power of Meta-rules In this section we provide an example of the expressive- ness power of meta-rules. We expose how a large set of rules can be generated from some few meta-rules. tax(amount->"%15"; comment->"15 percent HST") :- location(nb). Figure 6. A Tax Rule from the RACOFI System To this extend, we lean on the set of rules published on [Anderson et al., 2003]. This rule-set contains 20 rules that modify user ratings based on item similarity. An ex- ample of these modify rules can be seen on Figure 5. Other set of rules are the tax rules, there are 20 tax rules in the referenced paper. See Figure 6 for an example of these rules. Finally the last set of rules is the so-called: NotOf- fered rules. A sample of these rules can be seen on Fig- ure 7. There are 12 such rules in the rule-set. NotOffered(itemID->?itemID) :- userLevel(beginner), product(itemID->?itemID; impression->?IMP!?REST), $lt(?IMP,7,true). Figure 7. A NotOffered Rule from the RACOFI Sys- tem So the RACOFI rule-set defines more than 50 rules. This number is not very high but its maintainability can consume a lot of time. Also it is very hard to confirm that the rule-set is consistent with current data, suppose that it is necessary to modify a tax or a rating, it would be neces- sary to traverse the affected rules to check that everything is correct. Also on the fly rule generation, in order to in- crease adaptation, is a feature not easily covered with a static rule-set. rule modifyMetarule when RatingAmountPair($amount : amount, $rating : rating) ProductMetadata($metadata : metadata) then WriteModifyRule(filePath, $metadata, $amount, $rating); end rule taxMetarule salience -1 when TaxData($amount : amount, $location : location) then WriteTaxRule(filePath, $amount, $location); end rule notOfferedMetarule salience -2 when NotOfferedData($maximum : maximum, $metadata : metadata, $student student, $userLevel : userLevel) then WriteNotOfferedRule(filePath, $maximum, $metadata, $student, $userLevel); end Figure 8. A Meta-rule-set that Generates the RACOFI Rule-set The meta-rule approach discussed in this paper and im- plemented in our Meta-Mender Recommendation System is very useful to solve the problems mentioned above. Identifying common rule´s structure it is possible to write a concise meta-rule-set able to generate on the fly any 334 number of rules. This metadata, because meta-rules usual- ly gets metadata as input, driven rule generation help solv- ing the consistency problem, because rules can be regene- rated at any moment after a change in the metadata. Adaptation is also enhanced because different rules can be generated depending on user goals, needs and interests. As a side effect the development time of a rule-set is dras- tically reduced as long as a large number of rules can be generated with only one meta-rule. As an example a set of meta-rules able to generate the whole rule-set of the RACOFI system is shown in Figure 8. function void WriteModifyRule( String filePath, String metadata, String amount, String rating) { FileWriter output = new FileWriter(filePath, true); MessageFormat mFormat = new MessageFormat(""); output.write(mFormat.format( "modify(amount-> \"{0}\";\n", new Object[]{amount})); output.write( mFormat.format( " variable->{0};\n", new Object[]{metadata})); output.write(" product->?item)\n"); output.write(" :-\n"); output.write(mFormat.format( " rating(itemID->?item2;" + "{0}- >\"{1}\"!?REST0),\n", new Object[]{metadata, rating})); output.write(" product(" + "itemID->?item2;artist-" + ">?artist!?REST1),\n"); output.write(" product(" + "itemID->?item;artist-" + ">?artist!?REST2).\n"); output.close(); } Figure 9. The WriteModifyRule Function In Figure 9 it can be seen a function that generates the modify rule-set of the RACOFI System. The rest of the functions referenced in Figure 8 are similar to this one. 6 Conclusions and Future Work In this paper, we present a meta-rule based approach for rule generation. Meta-rules are rules that generate rules, and are able to generalize a rule-set providing bases for adaptation, reengineering and on the fly generation. In this paper, the benefits of meta-rules have been exposed. As an implementation example some details of Meta-Mender a meta-rule based Recommendation System have been also presented. The Meta-Mender Recommendation Sys- tem is a component of, at present, two educational appli- cations in development and test. The concept the Meta- Mender is based on, meta-rules for adaptation starts a new branch in the recommendation field and broadens the scope for new solutions in the field. Meta-rules are an effective solution to provide a perso- nalized recommendation to the learner, and constitute a new approach in rule-based Recommendation Systems. Meta-rules constitute a new abstraction level, which provides foundations for effective adaptive and persona- lized processes, such as in, e.g. learning. This abstraction level is also highly valuable for adaptation. Rules can be different for different users or for the same user at differ- ent periods of time. Meta-Mender is a recommendation system that implements this approach. At present, we use Meta-Mender in two R&D projects. In both, the next step is an evaluation phase with real data from actual users (in fact, two different, separate target groups). These two evaluation processes will provide a first-hand feedback of the implementation of Meta- Mender. Out of these results, we will refine the engine and we will design a visual authoring tool for meta-rules. Acknowledgments The research presented in this paper has been partially supported by the following projects of the Plan Avanza, a Spanish, nationally funded R&D programme: FLEXO (www.ines.org.es/flexo, TSI-020301-2009-9), GAMETEL (www.ines.org.es/gametel, TSI-020110-2009-170), TELMA (www.ines.org.es/telma, TSI-020110-2009-85). References [Abel et al., 2008] F. Abel, I.I. Bittencourt, N. Henze, D. Krause and J. Vassileva. A Rule-Based Recommender System for Online Discussion Forums. Proceedings of the 5th International Conference on Adaptive Hyper- media and Adaptive Web-Based Systems, Hannover, Germany, 2008. [Anderson et al., 2003] Michelle Anderson, Marcel Ball, Harold Boley, Stephen Greene, Nancy Howse, Daniel Lemire and Sean McGrath. RACOFI: A Rule-Applying Collaborative Filtering System. Proceedings of the IEEE/WIC COLA'03, Halifax, Canada, October, 2003. [Burgos et al., 2006] D. Burgos, C. Tattersall and R. Ko- per. Re-purposing existing generic games and simula- tions for e-learning. Special issue on Education and pedagogy with Learning objects and Learning designs. Computers in Human Behavior, 2006. [Burgos et al., 2007] D. Burgos, C. Tattersall and R. Ko- per. How to represent adaptation in eLearning with IMS Learning Design. Interactive Learning Environ- ments, 15(2), 161-170, 2007. [DROOLS] DROOLS. The Business Logic Integration Platform, http://www.jboss.org/drools. [El Helou et al., 2009] S. El Helou, C. Salzmann, S. Sire and D. Gillet. The 3A contextual ranking system: si- multaneously recommending actors, assets, and group activities. In Proceedings of the Third ACM Conference on Recommender Systems RecSys '09 ACM, New York, New York, USA, 373-376, 2009. [Forgy, 1982] C. Forgy. Rete: A Fast Algorithm for the Many Pattern/Many Object Pattern Match Problem, Ar- tificial Intelligence, 19, 17–37, 1982. [Herlocker, 1999] J. L. Herlocker. Position Statement - Explanations in Recommender Systems. In Proceed- ings of the CHI' 99 Workshop, Pittsburgh, USA, 1999. [McCarthy et al., 1985] J. McCarthy, P. W. Abrahams, D. J. Edwards, T. P. Hart, and M. I. Levin. LISP 1.5 Programmer's Manual, MIT Press, 1985. 335 [Moreno-Ger et al., 2006] P. Moreno-Ger, I. Martínez- Ortiz, J. Sierra and B. Fernández-Manjón. A Descrip- tive Markup Approach to Facilitate the Production of e Learning Contents. In Proceedings of 6th International Conference on Advanced Learning Technologies (ICALT 2006), 19-21, Kerkrade, The Netherlands, (IEEE Computer Society), 2006. [Moreno-Ger et al., 2008] P. Moreno-Ger, D. Burgos, I. Martínez-Ortiz, J. Sierra and B. Fernández-Manjón. Educational Game Design for Online Education. Com- puters in Human Behavior 24(6), 2530–2540, 2008. [Rocchio, 1971] J.J. Rocchio. Relevance feedback in in- formation retrieval, in the SMART Retrieval System. Experiments in Automatic Document Processing, Englewood Cliffs, NJ. Prentice Hall, Inc., 313-323, 1971. [Torrente et al., 2008] J. Torrente, P. Moreno-Ger, and B. Fernández-Manjón. Learning Models for the Integra- tion of Adaptive Educational Games in Virtual Learn- ing Environments. In Proceedings of the 3rd Interna- tional Conference on E-learning and Games, Nanjing, China. Lecture Notes in Computer Science 5093, 463– 474, 2008. 336 Social IPTV: a Survey on User-Acceptance Daniel Schreiber Telecooperation Lab Osama Abboud, Sandra Kovacevic Multimedia Communications Lab TU Darmstadt lastname@{tk.informatik,kom,cs}.tu-darmstadt.de Andreas Hoefer, Thorsten Strufe Peer-to-Peer Networks Abstract Incorporating Social Networking and IPTV, the two arguably fastest growing and most accepted services on the Internet today, yields strong syn- ergies. However, it opens a very complex de- sign space of feature combinations. Selecting features, aiming at achieving the best user accep- tance, hence, is a difficult, yet vital task for the development of an integrated service. This pa- per presents the results of an initial user study conducted to gain a better understanding of the complex design space. It identifies classes of demanded, promising features and indicates that social features in IPTV services in general will be well accepted, even if they are quite immer- sive to the TV experience. The study was con- ducted with user groups from central Europe as well as with a group from Korea. 1 Introduction IPTV, i.e., broadcasting of multimedia streams using well understood internet protocols, is a fast spreading technol- ogy for distributing multimedia TV content to consumers. With the rapid growth of the number of broadband links to the home as well as the available bandwidth at mobile devices, the connectivity of the consumer is constantly in- creasing, which opens the opportunity for innovative new services combining multimedia streams with interactive features. Besides IPTV, online social networking (OSN) services like Facebook etc. have been some of the most popular, fastest growing internet services in the last years. Thus, we conjecture that one promising approach to develop an innovative online service is Social IPTV, the integration of OSN and IPTV functionality into one consolidated service. Integrating both, poses to be a difficult task, since their utilization properties are located at two different extremes of participation: Currently, the TV experience is character- ized as being passive. However, in order to benefit from social TV, users will have to interact with the content as well as to communicate and collaborate with other viewers. Hence, the selection of integrated services and applications is non-obvious, since it needs to gauge between the comfort of passive content consumption and active contribution. In this paper we report initial results of an initial survey we conducted, which corroborate our hypothesis: Users will indeed accept and appreciate Social IPTV services. Fur- thermore, the user study revealed which features of OSN and IPTV integration are most interesting to the user. 2 Acceptance of Social IPTV Features Although it may seem natural to integrate OSN features into IPTV – e.g., for automatic recommendations and adapting the TV program to the user at hand [Vildjiounaite et al., 2008]; for creating a social viewing experience [Nathan et al., 2008]; or for joint commenting [Cattelan et al., 2008] – we wanted to make sure that this actually corresponded with user needs. Also, we wanted to find out which OSN features would be most valuable to end-users. To clarify these questions, we performed a two staged user study. 2.1 Step 1 - Focus Group Interview First, we conducted an in-depth interview with a focus group to gather ideas for potential Social IPTV services. Method The interview has been directed by one inter- viewer and minutes were both taken together at a white board and additionally by one observer of the interview. The focus group consisted of 20 technology savvy people with academic background. The interview has been con- ducted in three phases and took just under an hour. The first phase consisted of free brain storming, in which the group was encouraged to suggest any possible features, applica- tions or scenarios in the context of Social IPTV. All ideas and innovations were recorded on the white board in order of suggestion. The suggestions were ordered and grouped in the second phase of consolidation, in which the partici- pants were encouraged to extend or further detail some of the ideas as they were discussed. Finally, the participants were asked to roughly estimate innovation, degree of at- tractiveness, and technical feasibility of each item in the list in the last phase. Results The results of this interview were about 30 dis- tinct proposed applications. Several of these did not in- clude social aspects, but only relied on the additional flex- ibility of general IPTV, e.g., feedback, online database in- tegration, and targeted advertising. The applications that include social aspects can be grouped into five clusters. Content Recommendations Using information about what friends are currently watching or used to watch to provide viewing suggestions. Community Awareness Displaying information about who is watching the same program, who is watching TV at all to create a more social TV experience. Community Meta Content Comments and annotations of IPTV content, e.g., alternative opinios for news, an- notating glitches in movies. 337 End-2-End Communication Live communication with friends viewing the same content, e.g., text and video chat. Participatory IPTV Remixing of media streams, e.g., providing an alternative audio comment to a selected group of friends Social Applications Providing interactive applications to a community, e.g., polls, betting, or visual annotation. Discussion The results are grouped by immersiveness, i.e., adapting TV program recommendations based on the user’s profile in the OSN leaves the user completely passive – input is provided only implicitly by letting the system an- alyze the viewing behavior of the user and its friends in the OSN, thereby automatically creating a user model. Content recommendation based on viewer behavior and user mod- els has, e.g., been suggested by [Vildjiounaite et al., 2008]. Community awareness can be generated by ambient dis- plays, e.g., as proposed in [Harboe et al., 2008] or buddy list [Boertjes, 2007]. Thereby, it is important that the real- isation is adapted to the user at hand, her personal prefer- ences as well as the current viewing context. Community generated meta content is extremely popular in OSN, where users can, e.g., comment on other users, or take part and comment on virtual events. Such features for IPTV have been proposed in [Nathan et al., 2008]. Here, the content from the community is interwoven with the IPTV content and users have to explicitly contribute this content. Even more demanding in terms of user attention is live communication, e.g., by chat [Abreu et al., 2001], voice chat [Coppens et al., 2004], or video chat [Abreu et al., 2001]. The next two clusters represent even more immersive acts for the user. Instead of just using features provided by the system, the user becomes a content developer, i.e., providing his own IPTV content through live-streaming [Stickam Social Video Streaming, ], or content editing [Cattelan et al., 2008], or even providing applications, leveraging end-user development. This analysis shows that most of the applications have already been envisioned by existing projects. However, we are not aware of a systematic assessment whether which degree of immersiveness would actually be tolerated by the users, e.g., whether the added benefit of the applications would be worth the disturbance. 2.2 Step 2 - Questionnaire Survey As a next step, the acceptance of several Social IPTV ap- plications in an European and Korean audience has been tested in a survey. Three application scenarios from the focus group interview and a general scenario of offering a “Social Electronic Program Guide”, representing the Con- tent Recommendation cluster, have been selected for the survey. Participants were asked to state their feeling towards the importance of the problem presented in the scenario as well as to rate the envisioned solution. The four se- lected scenarios were the “SportsPub” (allowing a virtual crowd to watch a sports event together, connecting specta- tors via video and text chat), shared highlighting (collabo- rative, shared highlighting of areas on the screen, e.g., to bet where the next goal is scored), the “PlayerStats” (the possibility to add annotations to certain objects on screen), and the “Social Electronic Program Guide”. All scenar- ios follow the line of thought of offering a collaborative After finishing watching the news, Sue can’t decide what program to watch next. She activates the Social Elec- tronic Program Guide to see what her friends are watch- ing. Her friend Mick is following the news on the elec- tion day in “Marx’ Corner”, a group that Engin, an- other friend, has set up. Other friends are watching sports together and some have tuned into Casablanca... Figure 1: Stimulus for the “Social Electronic Program Guide” scenario. TV experience, including the possibility to interact in tele presence as well as allowing for both textual and graphi- cal annotation. The notion of allowing users to participate by creating their own applications has been woven into the “PlayerStats” scenario. Participants The questionnaire was administered to three different groups of participants: Group A - audience of the World Usability Day (n = 32, 4 female, 1 unstated); group B - students in a university course (n = 44, 26, 0); group C - ETRI1 staff (n = 17, 2, 1). User groups A and C are technology affine, while group B is more reluctant about new technology. Users in groups A and B are Euro- pean, users in group C are from Korea. The age of the users ranged from 15 over 60 years (group A: 1 person <20y, 18 persons: 20 - 29y, 10: 30 - 39y, 1: 40 - 49y, 1: 50 - 59y, 1 unstated, group B 2 <20y, 40: 20 - 29y, 1: 40 - 49y, 1: >59, group C: 1 <20y, 6: 20 - 29y, 9: 30 - 39y, 2: 40 - 49y). The amount of TV consumption differed between the three groups. In group A, only 40% stated they were watching TV more than 4h per week. For the other groups the values were 82% (B) and 61% (C). We believe that the reason for this is that the users in group A spent significantly longer time surfing the web compared to the other groups, we did not test this though. 47% - 52% in each group stated that they talked about what they watched on TV with their friends. Method In the questionnaire, each scenario was pre- sented as a combination of an illustrating picture together with a short, illustrative description of an exemplary situa- tion, in which the application would be used (cmp. fig. 1, and 2). The participants were confronted with the same set of questions for each scenario. 1Electronics and Telecommunications Research Institute, a Korean research institute 338 The questions were grouped in three parts: two questions to generally test if the participants had completely read and understood the scenario, two questions to test the partici- pants’ valuation and degree of esteem for the scenario, and an open question for comments or further ideas. The closed questions were designed offering four possible answers as a four-level Guttman Scale in order to avoid neutral answers. Following the four scenarios the participants had the oppor- tunity to provide further comments or ideas on the topic. The questionnaire ends with a set of statistic questions for the sample description. Even viewers can create new applications and share them with their friends: Lyn created a small applica- tion to tag objects in the show with some comments. She is a big fan of her team and especially likes Pepitto, the youngest player ever to play in the world cup! She knows all about him and shares her knowledge with the other viewers of Eiko’s SportsPub. Figure 2: Stimulus for the “PlayerStats” scenario. Results All four scenarios and applications were gener- ally perceived positively by all participants, in general all promise to make watching TV more fun. The three user groups had some characteristic differences, which can be seen in figure 3. Group B was less enthusiastic about the scenarios and solutions but thought that watching TV would be more fun and more interesting with the proposed enhancements. Group C was very enthusiastic but was more conservative in the rating of whether watching TV would be more fun and more interesting. Group A takes a middle position compared to groups B and C. A major concern consistenly mentioned in the responses to the open questions was that the system needs to preserve the user’s privacy and there should be the possiblity to mute social features in order to be able to watch certain shows without distraction. Summarising, the idea of providing an interactive, So- cial IPTV service has been accepted by our study sample, however users are not overly enthusiastic. Likewise, the three application scenarios have found some resonance in the study groups, and we think it makes sense to further in- vestigate what users liked about them and how they can be improved. 3 Discussion Oksman et al. have performed a study, which corresponds to a single aspect of our work: users in [Oksman and oth- ers, 2009] where asked, if they were interested in chatting with each other while watching TV. Our fundamental re- sults are supported by their conclusion that users consider watching TV as an inherently social activity. The study ad- ditionally backs the requirement of our users that privacy and the possibility to disable interactive features is vital for the acceptance of a social IPTV service. Given the evidence in existing studies on social TV and related topics (e.g., [Oksman and others, 2009] and [Har- boe et al., 2008]), as well as our own results it seems clear that social TV applications will be accepted by the user. This poses the question about how such applications should be built, i.e., which infrastructure is necessary to support such applications at runtime. The most important question in our opinion is how to generate the critical mass that is necessary to support so- cial TV features. The value of social applications to the user increases with the total number of users taking part in the OSN. Ramping up on the millions of users of OSN like Facebook or StudiVZ would make OSN applications highly valuable right from the start. To foster such kind of applications, extensions to existing well established IPTV platforms are needed. Existing solutions, like, e.g., the MHP [Piesing, 2006] API, currently do not forsee dedi- cated support for integrating OSN. The drawback of current OSN is that all information (profiles, friend relations) is stored on the servers of a provider. This introduces a severe security risk, which has received much attention recently [Bilge et al., 2009]. Pri- vacy concerns are relatively important to IPTV users. How- ever, IPTV systems incorporating set-top boxes installed at the user’s home open up the opportunity to create a new kind of online social network based on peer-to-peer prin- ciples that does not come with these security limitations [Cutillo et al., 2009]. These set-top boxes’ online times are longer and more predictable (i.e., until the end of the cur- rent show) than those of nodes in traditional peer-to-peer networks. Therefore, one might have the privacy advan- tage of a peer to peer based OSN, without the drawbacks of low availability of current research peer-to-peer OSNs. Therefore, another option would be to use IPTV technol- ogy to create a new Thus, the valuable data from OSN can be used to enhance the TV experience without putting the user’s privacy at risk. An additional benefit of this is that people who are not technology savvy are much more likely to join these social networks, which provides a much richer data basis for recommender systems. Inevitably, social TV will be interactive, i.e., users need means to provide community content, to chat with each other and even to do online remixing of streams. Thereby, the interaction with the TV should easily be scalable from a completely passive mode, where the user just consumes content; over a semi-interactive mode, where the user can, e.g., vote with the help of her TV remote; up to a higly in- teractive mode, where sophisticated input devices are used. It must be easy to switch between these different modes easily, requiring a much more flexible setup for in- and output devices than with current IPTV platforms. In this regard, we see a need for adapatation of the interface to the user and her needs. One approach is the use of stereo- types that allow the user to change between different levels of immersiveness, e.g., ranging from a completely passive 339 (a) Group A (b) Group B (c) Group C Figure 3: Results from the questionnaire for the three user groups. The median is marked by the bold line, the interquartile range with the box, max and min values by the whiskers. The first data point in each chart refers to how interesting the idea of the scenario was rated; the second data plot refers to how well the solution was appreciated; data point three and four relate to the answers whether the solution would make watching TV more fun and more interesting. traditional TV experience, to a completey interactive ex- perience, which is currently knwon from the web used on normal PCs. 4 Summary and Outlook This paper presented the results of a user study on the ac- ceptance of Social IPTV, which is a combination of so- cial networks and IPTV features. It introduces some so- cial IPTV scenarios, which have been generated from inter- views with focus groups. A subsequently performed surbey supports the hypothesis that social IPTV applications will be accepted by a central European as well as a Korean au- dience. Even immersive features, like content creation and end-user development, have been received very well and the results show that they will potentially be used. This is a strong argument for developing and providing the nec- essary infrastructure to implement such applications, since current platforms lack dedicated support for implementing Social IPTV applications. Although the study confirms that there is a potential benefit from integrating OSN features with IPTV, it did not reveal too much detail what the most interesting features would be. The focus interview resulted into an initial feature set. However, we are currently con- ducting further studies with working prototypes to find out which features or classes of features are accpeted by the end-user, and even more important, why these features are accepted. The outstanding difference between traditional TV and Social IPTV is the move from a passive TV experience, to a more immersive experience. Users expressed the wish to be able to return to the completely passive mode. We think this can be generalized, and the final Social IPTV system should allow for transition from completely passive to fully immersive interfaces. To implement this, we want to identify certain user stereotypes to which we can adapt the interface. Achknowledgements We thank the Korean ETRI for supporting our research. References [Abreu et al., 2001] Jorge Abreu, Pedro Almeida, and Vasco Branco. 2beon - interactive television supporting interpersonal communication. In Multimedia 2001, Pro- ceedings of the Eurographics Workshop. Springer, 2001. [Bilge et al., 2009] Leyla Bilge, Thorsten Strufe, Davide Balzarotti, and Engin Kirda. All Your Contacts Are Be- long to Us: Automated Identity Theft Attacks on So- cial Networks. In 18th Intl. World Wide Web Conference (WWW’09), pages 551 – 560, 2009. [Boertjes, 2007] Erik Boertjes. Connectv: Share the ex- perience. In EuroITV - Workshop: Social Interactive Television, 2007. [Cattelan et al., 2008] Renan G. Cattelan, Cesar Teixeira, Rudinei Goularte, and Maria Da Grac¸a C. Pimentel. Watch-and-comment as a paradigm toward ubiquitous interactive video editing. ACM Trans. Multimedia Com- put. Commun. Appl., 4(4):1–24, 2008. [Coppens et al., 2004] Toon Coppens, Lieven Trappe- niers, and Marc Godon. Amigotv : towards a social tv experience. In EuroITV, 2004. [Cutillo et al., 2009] Leucio-Antonio Cutillo, Refik Molva, and Thorsten Strufe. Safebook: a Privacy Preserving Online Social Network Leveraging on Real-Life Trust. IEEE Communications Magazine, December 2009. [Harboe et al., 2008] Gunnar Harboe, Crysta J. Metcalf, Frank Bentley, Joe Tullio, Noel Massey, and Guy Ro- mano. Ambient social tv: drawing people into a shared experience. In Proceeding of CHI, pages 1–10, New York, NY, USA, 2008. ACM. [Nathan et al., 2008] Mukesh Nathan, Chris Harrison, Svetlana Yarosh, Loren Terveen, Larry Stead, and Brian Amento. Collaboratv: making television viewing social again. In UXTV, pages 85–94, New York, NY, USA, 2008. ACM. [Oksman and others, 2009] Virpi Oksman et al. Tv is just one of the screens at home. consumers and changing tv watching. In EuroITV, 2009. [Piesing, 2006] J. Piesing. The dvb multimedia home plat- form (mhp) and related specifications. In Proceedings of the IEEE, volume 94, pages 237–247, 2006. [Stickam Social Video Streaming, ] Stickam Social Video Streaming. Online: http://www.stickam.com/. [Vildjiounaite et al., 2008] Elena Vildjiounaite, Vesa Kyllo¨nen, Tero Hannula, and Petteri Alahuhta. Unob- trusive dynamic modelling of tv program preferences in a household. In EuroITV, pages 82–91, 2008. 340 Using a Semantic Multidimensional Approach to Create a Contextual Recommender System Abdulbaki Uzun Service-centric Networking Deutsche Telekom Laboratories, TU Berlin abdulbaki.uzun@telekom.de Christian Ra¨ck Competence Center FAME Fraunhofer Institute FOKUS christian.raeck@fokus.fraunhofer.de Abstract Item recommendations calculated by recom- mender systems mostly in use today, only rely on item content description, user feedback and profile information. In modern mobile services, however, contextual information and semantic knowledge can play a significant role concern- ing the quality of these recommendations. There- fore, the SMART Recommendations Engine of Fraunhofer FOKUS is extended by the SMART Multidimensionality Extension and the SMART Ontology Extension that enable the recommender to incorporate contextual and semantic data into the recommendation process. The demonstra- tion of the SMART Ontology Extension visual- izes that the preciseness of recommendations can be increased by exploiting implicit and indirect knowledge, classification and location informa- tion gained from ontologies when generating rec- ommendations in the scope of an exemplary food purchase scenario. 1 Introduction Today, people are confronted with a large amount of infor- mation in the World Wide Web [Shenk, 1998]. Receiving a small subset of desired and filtered content through stan- dard search engines turns out to be very difficult. And it becomes quite impossible, if user specific needs and inter- ests should be taken into consideration. Recommender systems handle this issue by filtering rel- evant information and providing personalized content rec- ommendations to users based on their profile and feedback. Numerous recommendation methods were designed over the years to improve the accuracy of recommendations. The most popular ones are content-based and collaborative filtering algorithms or hybrid approaches comprising both them [Adomavicius and Tuzhilin, 2005]. The results delivered by hybrid methods are often ac- ceptable. However, they only depend on content descrip- tions and ratings given to items, and user profile informa- tion. In a time when mobile and location-based services become very popular, context information and semantic knowledge can play a decisive role in order to significantly improve the preciseness of personalized recommendations. If John, for example, is vegetarian, eats only organic food, tries to live economical and goes shopping nearby, it does not make sense to recommend him groceries in stores far away or only in discounters without taking his preference for vegetarian and organic food into consideration. This example shows that context as well as semantic informa- tion (e.g. implicit knowledge about which food products fit to certain eating preferences) are important to satisfyingly answer a user’s grocery recommendation request. In order to harness the potential of contextual and se- mantic information, the generic recommender system of Fraunhofer FOKUS, the SMART Recommendations En- gine [Raeck and Steinert, 2010], has been extended by two new recommender extensions. Inspired by the work of [Adomavicius et al., 2005], the SMART Multidimen- sionality Extension enhances the two-dimensional matrix representation of recommender data by a multidimensional recommendation model allowing the integration of addi- tional contextual information when generating recommen- dations. The SMART Ontology Extension, on the other hand, enables the recommender to incorporate contextual and semantic information gained from ontologies (e.g. im- plicit and semantic knowledge about a user and his pref- erences, location, time or ontological classification infor- mation). For this purpose, the extension provides a tool to exploit semantic data stored in ontologies and perform con- text and semantic filtering on the recommender database using several filters. Both extensions can be used indepen- dently from each other or together depending on the given scenario and application. This paper is organized as follows: First, an overview about related work in the field of context-aware and semantic recommender systems is presented. Afterwards, a background about the SMART Recommendations En- gine is given. Following that, concepts for the SMART Multidimensionality Extension are described. Section 5 explains the SMART Ontology Extension including the automated mapping of ontological information into the recommender’s data model. In section 6, the functionality of the SMART Ontology Extension is demonstrated within the scope of a food purchase scenario. 2 Related Work Contextual and semantic information is incorporated in the field of recommendations in various ways. A context- aware collaborative filtering system is presented by [Chen, 2005], which generates item recommendations for a user based on different context situations. In order to achieve that, the traditional collaborative filtering is extended, so that feedback of like-minded users in a similar context can be used to recommend items to the active user in his current context. [Adomavicius et al., 2005] propose a multidimensional recommendation model, in which additional contextual di- 341 mensions can be added to the traditional User x Item matrix representation. Recommendations are calculated by using the reduction-based approach, which reduces the problem of multidimensional recommendations to a traditional two- dimensional matrix in the required context, so that widely known traditional recommendation techniques can be ap- plied after the reduction is done. Another methodology is suggested by [Farsani and Ne- matbakhsh, 2006], which recommends semantic products to customers in the context of eCommerce based on prod- uct and customer classification via OWL. [Kim and Kwon, 2007], on the other hand, developed an ontology model with a multiple-level concept hierarchy for a grocery store scenario with four different ontologies: a product, location, consumer and record ontology. From the product ontology, most relevant products are taken using user information modeled in the consumer and record on- tology. Recommended products are presented in a concept hierarchy ranging from most specific to most broad. When users select some of these concepts, the context of the re- quest is subsequently refined enhancing the specificity of the provided results. Another interesting approach is demonstrated by [Yu et al., 2006]. They present a context-aware media recommen- dation platform called CoMeR, which uses an OWL ontol- ogy context description, a N x M-dimensional model and a hybrid processing approach to support media recommen- dations for smart phones. In another paper by [Setten et al., 2004], the integration of a context-aware recommender system into the mobile tourist application COMPASS is described, where users get touristic information and services recommended based on their interests and contexts. COReS, which stands for context-aware, ontology-based recommender system for service recommendation, is an- other example. It was developed by [Costa et al., 2007] and this recommender system extends the capabilities of the IN- FRAWARE [Pereira Filho et al., 2006] service platform by supporting service selection and user needs satisfaction for a certain context. Previous research activities are either focused on the integration of context or semantic information. However, incorporating both – context information and semantic domain knowledge – would increase the preciseness of recommendations decisively. The food scenario shows that the integration of both types of data is necessary to satisfy- ingly answer a recommendation request. For example, in order to generate accurate grocery recommendations, the system has to be aware of the location of the user (context) and has to be capable of using implicit knowledge in order to relate the user’s eating preferences to the right groceries (semantic information). That’s why, Fraunhofer’s engine was extended using both types of data. 3 SMART Recommendations Engine The SMART Recommendations Engine is a generic recom- mender system that enables internet businesses, rich media and entertainment services or SMEs deliver a more person- alized experience by providing recommendations in their respective application domain. By offering a flexible, gen- eral purpose algorithmic model, the engine makes it possi- ble to formulate application specific recommendation al- gorithms. These algorithms and the application specific data model are declared at configuration time by assem- bling selected components. By adding custom components through the provided API, the capabilities of the recom- mender can be extended in order to meet specific applica- tion demands. These custom components can be built from functional groups, such as basic mathematical operations, similarity and relevance computations, sorting and filter- ing, and data access. In the system, data is represented in an entity- relationship-like data model. An entity type that includes a set of entities is named domain, whereas relations between domain entities are represented by matrices (see Figure 1). A user domain, for example, can contain all users, while an item domain can comprise all items in a specific appli- cation. Ratings given by a user to a certain item can be represented by a rating matrix, whose rows and columns are associated to the these domains. Figure 1: Recommender Data Model Recommendation algorithms calculate relevance values for each User x Item pair. Differing on the given applica- tion, the recommendation algorithm is assembled at run- time configuration by defining a network of matrix trans- formation components (e.g. a similarity computation com- ponent). The matrix operations are applied on a data set in a hierarchical manner leading to the estimated utility func- tion at the top node of the tree. Figure 2 shows such a generic algorithm hierarchy, where the nodes represent ma- trix operations and data is propagated along the edges in form of matrices. Figure 2: Generic Algorithm Hierarchy The engine also provides a custom query language called Sugar Query Language (SuQL), which is used to request recommendations and related data at runtime. Via SuQL various recommendation algorithms can be selected and combined. Constraints on the item set to be recommended can also be specified, so that only items that fulfill these constraints are included in the final recommendation. For this purpose, the recommender already offers a variety of sorting and selection filters, which can be used to alter the result set by certain properties. One example for a filter is the Proximity Filter, which in combination with a Geo-Location typed domain, an Item x Location matrix, a given center position and a maximum range, is capable of selecting items (e.g. shops) in a given geographic region. By using the lookup capabilties of the SMART Ontology Extension (see section 5), items 342 can be filtered based on location constraints, like finding all grocery products sold in shops located within a given distance from the user’s current location. 4 SMART Multidimensionality Extension Contextual information can be exploited in the recom- mendation process by enhancing the level of dimensions from the traditional two-dimensional paradigm to multiple dimensions. Therefore the SMART Recommendations Engine is upgraded by the SMART Multidimensionality Extension, which is able to handle multidimensionality and hence provide more precise recommendations. 4.1 Enhanced Data Model This extension enhances the generalized data model of the SMART Recommendations Engine for multidimensionality purposes by binding more than two domains to a rating ma- trix. Figure 3 shows the enhanced data model in an entity- relationship notation. Figure 3: Enhanced Data Model However, the recommender is designed to work two- dimensional; a matrix can only consist of two domains (one row domain and one column domain). That’s why, the en- hanced data model has to be adapted to the standard data model of the recommender. By merging several domains to one domain and hence mapping multiple dimensions to a two-dimensional matrix, the two-dimensional recom- mender data model can be kept with the advantage that multidimensionality features can be utilized at the same time (see Figure 4). Figure 4: Merged Domains in the Enhanced Data Model The two-dimensional matrix representation of multiple dimensions can be done through serialization by use of a hierarchical index. Each slice of the multidimensional cube is stored into one two-dimensional matrix making it possible to access the desired element with the help of an index (e.g. Slice1.User#1). Figure 5 shows a general MD-2D-Mapping for the enhanced data model. Figure 5: General MD-Mapping to 2D 4.2 Multidimensional Recommendation Algorithms Various standard recommendation algorithms, such as user or item-based collaborative filtering, can be extended by the extension to multidimensional algorithms. The standard user-based collaborative filtering algo- rithm applied by the SMART Recommendations Engine, for example, determines similar users by using ratings given for items by them. Based on this similarity and on the ratings given to items, predictions for new items are cal- culated. This algorithm can be upgraded to a multidi- mensional user-based collaborative filtering algorithm by merging several domains to one column domain. Assume that there is a rating matrix R[U :: C][I] in the User :: Context x Item space, in which User :: Context displays users with a certain context. R[U :: C][I] = User :: Context× Item (1) The similarity between users in the same context is deter- mined based on their ratings for items by applying a simi- larity transformation on R[U :: C][I]. This calculation also identifies similarities between users in different context sit- uations, which can be useful for later recommendations. Sim(R[U :: C][I], R[U :: C][I]) = User :: Context× User :: Context (2) Having a similarity matrix Sim[U :: C][U :: C] and the rating matrix R[U :: C][I], a matrix product transformation can be applied to calculate item predictions for users in a certain context. PredictionP [U :: C][I] = Sim[U :: C][U :: C] ∗R[U :: C][I] = User :: Context× Item (3) These predictions can be further processed and sorted to generate new recommendations for users in certain con- texts. To better illustrate the new multidimensional user-based collaborative filtering algorithm, an example query “John wants to buy food for a soccer evening” is used. This query includes three dimensions: the user (John), food and the event dimension (here: soccer evening). For that matter, the similarity between users, who bought food for a soccer evening, is calculated using the User :: Event x Food rating matrix twice for the similarity computation (see Figure 6). By applying a matrix product transformation on the calculated similarity matrix User :: Event x User :: Event and the rating matrix, food predictions for John in the context of a soccer evening can be identified. These predictions can further be filtered and sorted based on John’s eating habits, for example, and the rest result can 343 Figure 6: Exemplary Multidimensional User-based Collab- orative Filtering Algorithm – Part 1 be presented as generated recommendations for John (see Figure 7). Figure 7: Exemplary Multidimensional User-based Collab- orative Filtering Algorithm – Part 2 5 SMART Ontology Extension Data that provides additional and useful information to the traditional User x Item representation, such as taxonomies, implicit and indirect knowledge about a user’s preferences or location information can enhance the quality of recom- mendations. Semantic web ontologies offer complex knowledge rep- resentation possibilites with which such semantic informa- tion can be modeled (the Web Ontology Language (OWL) [Dean and Schreiber, 2004]). Implicit knowledge, i.e. knowledge that is not directly included in an ontology, can be inferred using reasoning technologies. The SMART Ontology Extension provides features with which the SMART Recommendations Engine is capable of exploiting semantic information from ontologies including information gained from reasoning mechanisms. In that way, implicit and indirect knowledge as well as taxonomies become available when generating recommendations. The extension is divided into two main parts: The first part is the Ontology Mapping, where all relevant data avail- able in pre-designed ontologies is mapped onto the data model of the recommender. The second part performs se- mantic filterings on the previously extracted data set using the Ontology Filter in order to generate semantic recom- mendations. The functionality of the extension is explained using pre-designed ontologies for the food scenario. It comprises users with special eating habits (such as vegetarian, vegan or organic), who want to buy food from grocery stores nearby based on their eating preferences. 5.1 Ontology Mapping The main constructs included in OWL ontologies are in- dividuals, classes, a class hierarchy, object properties, datatype properties and restrictions. These constructs can be mapped onto the data model of the recommender, so that the recommendation engine becomes capable of handling ontology information. Each class hierarchy in OWL is represented by a domain with the name of its respective root class (e.g. Food, User or GroceryStore). Individuals are items of a certain class and are mapped as items of their respective class hierarchy recommender domain. Furthermore, all classes in an ontol- ogy become elements of the general Class domain. While a Food x Class matrix, for example, shows to which classes a food product belongs to, taxonomies become clearly rec- ognizable in a Class x Class matrix. Figure 8 shows an exemplary ontology class structure mapping for the food scenario. Figure 8: Root Class Representation Object properties in OWL form a relation between in- dividuals, whereas datatype properties build a relation be- tween individuals and data types, such as Integer, String or Boolean. Matrices display relationships between domains and are therefore equal to object and datatype properties. The domain and range an OWL property has are both repre- sented as domains in the recommender and their values are elements of these domains. If a property’s domain or range is defined as an ontology class, the recommender domain name corresponds to the class’s name as well. Otherwise, it corresponds to the property’s name. In the example of Figure 9, the property eatingHabit has a domain named User and data range values, such as Ve- gan, Vegetarian or Halal. Therefore the recommender do- mains are User and EatingHabit (name of the OWL prop- erty) defining a matrix. The values in the matrix are also di- rectly mapped from the ontology and represent the assigned property values to individuals (e.g. John eatingHabit Vege- tarian). Based on the Ontology Mapping naming conven- tions, the Ontology Mapping tool automatically creates a recommender-compliant database including domains, 344 Figure 9: Datatype Property Representation matrices and filters. Knowledge gained by reasoning mechanisms is also exported via the tool making it possi- ble to exploit implicit knowledge in the recommendation process. As seen in the screenshot of the tool (Figure 10), the left part of the GUI represents the Ontology Import and Ontology Export functions, whereas the right part comprises all functions concerning the recommender. Figure 10: Ontology Mapping Tool 5.2 Ontology Filter After successfully exporting ontology data into the recom- mender’s database, the Ontology Filter can process the on- tology information in the engine by performing a semantic filtering. This new filter extends the available filter list of the SMART Recommendations Engine and makes it possi- ble to query implicit and indirect ontology knowledge at runtime. Two different matrix operations are provided by the On- tology Filter, the Concept Lookup and the Matrix Lookup. In order to accomplish their task, both lookups use stan- dard set operations, which are also features of the Ontology Filter. The Concept Lookup uses the Union and Intersec- tion set operations, while the Existential quantification and Universal quantification are utilized by the Matrix Lookup operation. If needed, the result set delivered by the Ontol- ogy Filter can be inversed by the Not set operation. This can be helpful in order to determine all products that can be eaten by John instead of all that are forbidden for him, for example. Ontology concepts can be looked up in the recommender by using the Concept Lookup, which needs at least two dif- ferent matrices for the operation. Hereby, the column do- main of the first matrix has to be the row domain of the second matrix. Applied on the first matrix, the Concept Lookup filters certain column elements for one single row element based on given constraints. These constraints can be generated by any available SMART Recommendations Engine filter, such as the Existing or Feature filters. The filtered column elements – now selected rows in the row domain of the second matrix – build the basis for further operations. All selected rows will be processed again indi- vidually depending on given filters. The calculated result sets of each selected row will be returned and will then be either unified or intersected based on the selected set oper- ation (Union or Intersection). Figure 11 shows a Concept Lookup example, in which ingredients are looked up that are not allowed to be eaten by John. Figure 11: Concept Lookup Example The Matrix Lookup filters information in a matrix based on a given column domain result set of another matrix. Therefore, it also requires the use of two different matrices, whereas the column domain of the first matrix remains the column domain of the second matrix. Rows of the second matrix will be filtered based on the given column domain result set and a predefined set operation (existential quan- tification or universal quantification). The result is one set of filtered row elements. In Figure 12, an exemplary Ma- trix Lookup can be seen. All food products are looked up that are forbidden for organic eating people. Figure 12: Matrix Lookup Example An overview of all supported operations of the Ontology Filter is given in Table 1. Complex recommendation queries require combining both lookups to single a Concept and Matrix Lookup operation, which is exemplary demonstrated in section 6. 6 Demonstration Powerful contextual/semantic recommendations can be computed by the recommender using the SMART Ontology Extension in combination with the Proximity Filter. The ontology data present in the database of the SMART Recom- mendations Engine facilitates the recommendation query 345 Filter Matrix Ops. Set Ops. Ontology Filter Concept Lookup Union Intersection Matrix Lookup Existentialquantification Universal quantification Not Table 1: Ontology Filter Operations requests of a user by reducing his effort to manually de- scribing his personal needs and interests in a detailed way. In conjunction with the Proximity Filter, the recommenda- tions are also filtered by their location leading to a great user experience when using a mobile shopping service, for example. This function is going to be demonstrated in this section by taking the following mobile grocery shopping service scenario as a basis: Assume that John is vegetarian and eats only organic food products. He wants to buy groceries nearby his lo- cation and that’s why he wants his shopping application to give him recommendations based on the food categories he prefers. After the selection of food categories for his shop- ping cart, John gets food products recommended for each category fitting his eating preferences and sorted by their relevance based on John’s previous purchases. The grocery stores that sell these products and are in a certain range to John are also listed. Eating preferences, such as eating vegetarian or organic, include some sort of implicit and indirect knowledge. For example, in order to be able to specify vegetarian prefer- ences, the system must know what kind of food vegetar- ians do not eat. It further has to have the information, which ingredients are included in each grocery, so that non- vegetarian food can be filtered based on the list of non- consumable ingredients. Since semantic ontologies are pre- destined to model this kind of dependencies, three ontolo- gies were designed including a complete data set to map relevant information given in the food scenario. The food-ingredient-shop ontology consists of a classi- fication of food categories, concrete grocery products in- cluding their ingredients and shops, in which they can be found. The eatinghabit-ingredient ontology, on the other hand, models relations between certain eating preferences and the ingredients that are either allowed or forbidden to eat. User profiles are stored in the user-profile ontol- ogy. All data modeled in these ontologies is exported into the recommender database via the Ontology Mapping tool. Figure 13 shows parts of the created data model after a suc- cessful export. Using the SuQL, the SMART Recommendations En- gine can now provide contextual/semantic recommen- dations based on the available data set and the im- plemented recommendation algorithms. For the food scenario, assume that John wants to buy snacks and bread in the range of 2 kms from his location and that he already purchased the brown bread product Naturkind SonnenblumenVollkornbrot Geschnitten. The SuQL query is constructed as follows: At first several se- mantic filterings are performed using the lookup operations several times in order to identify all snack and bread prod- Figure 13: Food Scenario Ontology Data Model ucts that fit John’s eating preferences and his location. Af- terwards, these elements are sorted by their relevance and limited to a certain number depending on the relevance pre- dictions calculated by the recommendation algorithms. Figure 14: Concept and Matrix Lookup Figure 14 shows a semantic filtering example for John, in which the Ontology Filter first performs a Concept Lookup in the User x EatingHabit matrix that looks up his eating preferences. In the second matrix (EatingHabit x Ingre- dient), John’s eating preferences (vegetarian and organic) are mapped to the ingredients. (Note: Due to the fact that organic food products cannot be identified by their ingredients, artificial ingredients, such as Ing Organic or Ing NotOrganic were created and related to the food prod- ucts.) The Matrix Lookup then looks up all groceries in the Food x Ingredient matrix for vegetarians and organic eating people individually. Finally, both result sets are unified to one single result set and inversed by the Not set operation in order to get a set of groceries, which can be eaten by John. These groceries are also filtered by their categories, so that only snacks and bread products remain. Another semantic operation in form of a Matrix Lookup is utilized, so that all grocery stores near to John’s location 346 that sell these food products can be looked up. This is done by using the Proximity Filter, where the location of John is specified as well as the range to look for. All grocery stores near to John’s position are delivered to the Matrix Lookup, which then performs a lookup to find all filtered snacks and bread products that are available in these shops. In a final step, the recommendation algorithm is used in the recommendation process. Even though the engine can also generate recommendations based on different collab- orative filtering algorithms, this paper focuses on an ex- tended version of the content-based filtering approach us- ing the ontology taxonomy as content meta-data. This ap- proach – named as the ontology-based filtering algorithm – calculates relevance predictions using the similarity of food products based on their cagetory as content meta-data (e.g. brown bread is more similar to white bread than to snacks) and the implicit user feedback given by users automatically when purchasing items. This user feedback affects the final recommendations in that way that groceries of a category the user has already bought products before become more relevant to him than products of other nodes in the tree. The ontology taxonomy stored in the Food x Class ma- trix is used to compute a similarity between groceries by means of their categories leading to a food similarity ma- trix with the dimensions Food x Food. User x Food rele- vance predictions based on the taxonomy information can be gained by applying a matrix product transformation on the User x Food feedback matrix and the Food x Food sim- ilarity matrix (see Figure 15). Figure 15: Ontology-based Filtering Algorithm Figure 16 shows the SuQL recommender response to John’s query. John bought Naturkind SonnenblumenVollkornbrot Geschnitten before, which is from the bread product category BrownBread. This and all other BrownBread products are most relevant to him. The relevance value decreases the more vegetarian and organic bread products are in other nodes of the Bread tree. There is only one snack fitting his eating preferences with a low relevance value since John did not purchase any snacks yet. As a result, five bread products and one snack are presented in combination with the grocery stores nearby that sell these products. 7 Discussion of the Approach One way to compare the presented approach with widespread memory-based and model-based recommenda- tion algorithms is to compare the accuracy of these ap- proaches based on a common metric such as the mean av- Figure 16: Ontology-based SuQL Recommender Response erage error (MAE). For that, a suitable set of data points for training and testing is needed. The chosen application domain of the presented demonstration scenario was pre- determined by the project’s context in which our approach was developed. This context did not provide us with a suit- able data set for this kind of practical testing. However, a closer examination of our approach shows that a bench- mark recommendation algorithm, which has been adapted to make use of the product hierarchy that is stored in the ontology, provides better relevance values by taking the re- lation of different food items into account. The relevance values of unrelated items decrease in less relevant or un- suitable nodes of the product hierarchy, so that the overall recommendation quality in terms of the achieved accuracy increases. 8 Conclusion This paper introduced the generic recommender sys- tem, the SMART Recommendations Engine of Fraunhofer FOKUS, which is capable of providing recommendations for different types of applications. In order to make the engine meet the recommendation quality requirements of modern mobile services, the recommender was extended by two new extensions. The SMART Multidimensionality Extension provides multidimensionality capabilities to the so far two-dimensional recommender system. This upgrade enables the recommender to deal with additional context dimensions in order to generate more accurate recommen- dations taking user’s current contextual situation into ac- count. The SMART Ontology Extension, on the hand, ex- ploits semantic ontology data by exporting all relevant in- formation given in pre-designed application-dependent on- tologies into the data structure of the SMART Recommen- dations Engine enabling the recommender to use semantic knowledge or ontology taxonomy information for recom- mendation purposes. The demonstration showed that in conjunction with the Proximity Filter, the SMART Ontology Extension provides an added-value to the engine by generating semantic and contextual recommendations. 347 References [Adomavicius and Tuzhilin, 2005] Gediminas Adomavi- cius and Alexander Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the- art and possible extensions. IEEE Trans. on Knowl. and Data Eng., 17(6):734–749, 2005. [Adomavicius et al., 2005] Gediminas Adomavicius, Ramesh Sankaranarayanan, Shahana Sen, and Alexan- der Tuzhilin. Incorporating contextual information in recommender systems using a multidimensional approach. ACM Trans. Inf. Syst., 23(1):103–145, 2005. [Chen, 2005] Annie Chen. Context-aware collaborative filtering system: predicting the user’s preferences in ubiquitous computing. In CHI ’05: CHI ’05 extended abstracts on Human factors in computing systems, pages 1110–1111, New York, NY, USA, 2005. ACM. [Costa et al., 2007] A.C. Costa, R.S.S. Guizzardi, G. Guizzardi, and J.G.P. Filho. Cores: Context-aware, ontology-based recommender system for service recom- mendation. 19th International Conference on Advanced Information Systems Engineering (CAISE07), 2007. [Dean and Schreiber, 2004] M. Dean and G. Schreiber. Owl web ontology language reference, 2004. http://www.w3.org/TR/owl-ref/. [Farsani and Nematbakhsh, 2006] H.K. Farsani and M. Nematbakhsh. A semantic recommendation procedure for electronic product catalog. Interna- tional Journal of Applied Mathematics and Computer Sciences, 3:86–91, 2006. [Kim and Kwon, 2007] S. Kim and J. Kwon. Effective context-aware recommendation on the semantic web. International Journal of Computer Science and Network Security, 7:154–159, 2007. [Pereira Filho et al., 2006] J.G. Pereira Filho, R.M. Pes- soa, and C.Z. Calvi. Infraware: A support middleware to context-aware mobile applications (in portuguese). In Proceedings of the 24th Simpsio Brasileiro de Redes de Computadores, 2006. [Raeck and Steinert, 2010] Christian Raeck and Fabian Steinert. Fraunhofer institute fokus, smart recommen- dations engine, 2010. http://tinyurl.com/3xdsffz. [Setten et al., 2004] Mark Van Setten, Stanislav Pokraev, Johan Koolwaaij, and Telematica Instituut. Context- aware recommendations in the mobile tourist applica- tion compass. In In Nejdl, W. and De Bra, P. (Eds.). AH 2004, LNCS 3137, pages 235–244. Springer-Verlag, 2004. [Shenk, 1998] David Shenk. Data Smog: Surviving the Information Glut. Harper San Francisco, 1998. [Yu et al., 2006] Zhiwen Yu, Xingshe Zhou, Daqing Zhang, Chung-Yau Chin, Xiaohang Wang, and Ji Men. Supporting context-aware media recommendations for smart phones. IEEE Pervasive Computing, 5(3):68–75, 2006. 348 1 Author Index Abdulmutalib, Najeeb 169 Ahmadi, Babak 13 Al-Kouz, Akram 19 Albayrak, Sahin 19, 209 Althoff, Klaus-Dieter 269 Azzam, Hany 175 Bach, Kerstin 269 Bauckhage, Christian 97 Baumeister, Joachim 239, 247, 267 Blank, Daniel 183 Bockermann, Christian 25 Boley, Mario 33 Burgos, Daniel 281, 331 Busche, Andre 35 Cheng, Weiwei 39 Clausen, Jan 19 Decker, Bjo¨rn 275 Dembczynski, Krzysztof 39 Dengel, Andreas 255 Dolog, Peter 315 Doost, Ahmad Salim 289 Eichler, Kathrin 47 Ernst-Gerlach, Andrea 193 Freiberg, Martina 239 Fricke, Peter 51 Friesen, Natalja 59 Fuhr, Norbert 169, 193 Fu¨rnkranz, Johannes 81, 143 Ga¨rtner, Thomas 163 Gey, Fredric 199 Godoy, Daniela 295 Gohr, Andre´ 67 Griesbaum, Joachim 221 Haas, Lorenz 267 Habich, Dirk 69 Hadiji, Fabian 13 Hahmann, Martin 69 Hatko, Reinhard 247 Heckmann, Dominikus 303 Hees, Jo¨rn 255 Henrich, Andreas 205 Herder, Eelco 307 Hinneburg, Alexander 67 Horvath, Tamas 33, 75 Hotho, Andreas 151 Hub, Adrian 205 Hu¨llermeier, Eyke 39 Iqbal, Aftab 77 Janssen, Frederik 81 Jawad, Ahmed 89 Jungermann, Felix 51 Kando, Noriko 199 Karnstedt, Marcel 159 Kawase, Ricardo 307 Kersting, Kristian 13, 97 Klan, Daniel 105 Kluegl, Peter 151 Krohn-Grimberghe, Artus 35, 113 Kurbjuhn, Bastian 139 Land, Sebastian 159 Larson, Ray 199 Lechtenfeld, Marc 191 Lehner, Wolfgang 69 Lemmerich, Florian 267 Loza Mencı´a, Eneldo 121 Lommatzsch, Andreas 209 Luca, Ernesto William De 19, 209 Mandl, Thomas 221 Mayr, Philipp 213 Melis, Erica 289 Minor, Mirjam 261 Mitlmeier, Johannes 239 Moi, Matthias 323 Morik, Katharina 137 Mutschke, Peter 213 Nanopoulos, Alexandros 113 Neumann, Gu¨nter 47 Pan, Rong 315 Papadakis, George 307 Piatkowski, Nico 51 Ploch, Danuta 209 Poigne, Axel 33 Po¨litz, Christian 125 Puppe, Frank 239, 247 Raber, Michael 261 349 Ra¨ck, Christian 341 Ramon, Jan 75 Reinhardt, Wolfgang 323 Reutelshoefer, Jochen 247, 267 Roelleke, Thomas 175 Rohe, Thomas 105 Romero, Vicente 331 Roth-Berghofer, Thomas 255 Rueping, Stefan 59, 129 Sauer, Christian Severin 269 Schaer, Philipp 213 Schmidt-Thieme, Lars 35, 113 Schmitt, Ingo 229 Schowe, Benjamin 137 Schreiber, Daniel 337 Schult, Rene 139 Spiliopoulou, Myra 67 Spinczyk, Olaf 51 Stahr, Christoph 163 Stolpe, Marco 51 Sulzmann, Jan-Nikolas 143 Tawileh, Wissam 221 Thiele, Maik 69 Thurau, Christian 97 Toepfer, Martin 151 Traphoener, Ralph 275 Ullrich, Katrin 163 Umbrich, Ju¨rgen 159 Uzun, Abdulbaki 341 Varlemann, Tobias 323 Wahabzada, Mirwaes 97 Wilke, Adrian 323 Wintjes, Jorit 267 Wrobel, Stefan 33 Xu, Guandong 315 Zellhoefer, David 229 350