Information Systems and Machine Learning Lab, University of Hildesheim, Germany

Master/Diploma and Bachelor/Studienarbeiten thesis topics:
(

methodological focus,

technical focus)

Master theses

Past master theses at University of Freiburg:

Dominik-Benz

2007

Collaborative Ontology Learning

Ontology Learning is usually referred as the semi-automatic extraction of semantics from the Web. It builds basically on techniques from Text Mining and it combines machine learning techniques with methods from fields like Information Retrieval and Natural Language Processing applying them to discover the semantics in the data and to make them explicit.

The preceding discussion usually implicitly assumes that content exists independently of its usage. However, a large proportion of knowledge is socially constructed, especially, for example, common sense knowledge which is derived from and maintained by social interactions. As examples we have of the so called folksonomies, i.e, collaborative categorization of items using freely chosen keywords. In contrast to formal controlled vocabularies (also called taxonomies), folksonomies are flat (no hierarchy), unsystematic and unsophisticated; however, for Internet users, they dramatically lower content categorization costs because there is no complicated nomenclature to learn.

When the data gets too large organization structures that facilitate the navigation through the data are required. Due to the folksomies' flat structure it is difficult to find broader or narrower tags which may better represent the user's current interests.

The task of this topic is to devise and implement a suitable method for inducing taxonomies among the tags. The method must be evaluated in data samples of realistic tagging systems comparing the induced hierarchies with gold standard taxonomies.

Stefan Siegler

2006

Learning Bayesian Network Structures

Learning the structure of Bayesian Networks from data is one of the core tasks in Bayesian Networks that has been addressed by several algorithms recently. Starting from the CGNM/BN implementation in Java, that supports basic tasks as IO, inferencing and parameter learning, first some simple and fast algorithms like K2 should be implemented. Building on that, two more complex algorithms (PC and GES) should be implemented efficienlty. The performance of these algorithms should be evaluated in terms of quality of the solution found as well as runtime on real life datasets. Finally, synthetic datasets created from synthetic BNs should be used to assess the algorithms under controllable conditions.

Manuel Stritt

2006

Mixture Models for Wafer Failure Analysis

During production of semiconductor chips, silicon wafers consisting of 100-1000 parts are processed in many different steps. Only after the chips have been finished completely, their functionality can be tested. To assure a high quality of the final product, up to several hundred tests are conducted per chip. As chips are discarded alredy due to a single failing test, a test sequence is stopped upon encounter of the first failure. The vector of all test results is called failure vector or fingerprint.

In real production environments, detractors can cause failures. Usually there is not just a single cause responsible for such failures, but typical mixtures of different causes.

The task of this diploma thesis topic is to adapt mixture models based on Bayesian Networks to the problem, implement a suitable learning algorithm and run empirical experiments. Different types of data should be analyzed:

synthetic data generated by a domain model,
simplified real data, and
complex real data.

This diploma thesis topic is offerend jointly with Infineon technologies, Regensburg.

Zhiwei Wei

2006

Integrating Bayesian Networks and Collaborative Filtering for Recommender Systems

One of the recent most successfull models for recommender systems is a simple Bayesian Network consisting of only four nodes for user ID, item ID, rating and a hidden class node, the so-called "aspect model" by Hofmann 2004. While the aspect model gives high-quality recommendation, the learning process based on the expectation-maximization (EM) algorithm is slow. Traditionally, simple nearest neighbor models called collaborative filtering have been used for this task. Compared to the aspect model, these models are fast. To improve prediction accurracy, probabilistic ideas have been integreated into collaborative filtering techniques.

The task of this topic is to work the other way around and use initialization schemes based on collaborative filtering techiques to accelerate the Bayesian Network learning. Furthermore, the results should be compared with plain collaborative filtering, with the aspect models as well as with probabilistic collaborative filtering.

Oliver Olesen

2006

An editor generator for instances of XML Schema

Writing a schema valid XML document makes several demands on the author. He has to know about wellformed XML on the one hand as well as about the standard of XML schema. Last but not least he has to be aware of the specific schema that is to be implemented/fulfilled.

There already are schema aware text editors helping the author with most of the listed issues. However in this thesis an editor generator shall be designed and implemented. The generator ought to produce a specific editor per schema.

The generated editor ought to provide a form based way to edit schema valid XML documents. In addition the editor is intended to be capable of validating constraints that exceed the expressive capability of XML schema.

This topic is offered in cooperation with Vector Consulting GmbH, Stuttgart.

Magnus Herold

2006

Collaborative Personal Ontology Evolution

Ontologies specify the knowledge about a domain of interest using formal semantics, for example the products offered by an online-shop as well as a taxonomy of product categories organizing the products for better browsing and searching or an information portal like a digital library organizing books in a hierarchical category system.

Personalized semantic applications allow users to have a personal copy of the ontology and to tailor it to their needs: e.g., they can select to see only a subset of all the categories available, merge existing categories, or introduce completely new categories. In large information portals one can try to support users in maintaining their personal ontology by recommending them changes based on ontologies of other users. For example, if a user has many articles about "Java" and "C++" in his personal bibliography but all of them in a common category "programming languages", but other users have a better organization using two subcategories for "Java" and "C++", respectively, one would like to recommend the user to add these two subcategories and assign the papers accordingly.

Methods from Machine Learning, especially from Recommender Systems and Collaborative Filtering can be applied to learn such recommendations.

The task of this topic is, to implement different strategies for recommending categories, their super- und subcategories and the assignment of products to the categories: very simple strategies that take into account other users' ontologies only in a summary way, e.g., always recommend the most often used concept first, as well as personalized recommenders based on collaborative filtering methods. The strategies should be evaluated on several synthetic datasets.

Steffen Rendle

2006

Product Identification and Clustering

Automatically structuring offers is a key task in e-commerce. Especially if offers are collected from different shops, two problems arise: - identifying products in different offers (e.g., the same product offered in different shops). - grouping similar products in categories based on some meta-data about the products that eventually is extracted automatically from HTML pages and thus might be dirty. Although hand-crafted identifications and groupings usually are of very good quality, this approach is too expensive and for product identifications too slow.

The task of this topic is to use information extraction methods for extracting suitable features from the textual product metadata and to cluster the products using a heuristic similarity measure using string distances (Levenshtein/edit distance) and term weighting schemes (tf.idf). Models for two or three different domains should be build with these methods for real-life datasets. As these heuristics are expected to be domain-specific, in a second step, the similarity measure should be automatically adapted based on explicit user feedback that eventually is collected via active learning.

This topic is offered jointly with Mentasys GmbH, Karlsruhe, who provide data and domain expertise.

Christine Preisach

2006

Ensembles of relational and textbased models with applications to the classification of scientific publications

In traditional text classification only attributes (like words included in title or abstract) of the document itself have been considered. But many documents are related to other documents by their metadata, for instance by references, same authors, conferences or journals.

The goal of this topic is to use these relationships for classification and to analyse whether it improves classification accuracy. Therefore several methods from the area of probabilistic relational learning should be applied to three bibliographic datasets. Furthermore it should be investigated whether a combination of relational classification and traditional text classification can improve classification accuracy.

Ashraf Yassin

2005

A collaborative, bibliographic Wiki

Collaboration over the internet constantly is getting more important in research and industry. Many different application patterns like Wikis emerged to support the ad-hoc style often encountered in such collaborations. While Wikis are perfectly suited for free-style, loosely linked texts, the management of more structured information as bibliographic collections is not well supported (but see e.g., wikindx for an existing bibliographic Wiki).

Starting from a review of existing collaborative, web-based bibliographic tools (see e.g., Resource list of OpenOffice for such tools) and their main features, a design for such a tool based on a wiki platform should be developed and implemented on top of XWiki (or any other suitable, Java-based Wiki-platform). A special focus of this implementation should be (1) the management of access rights to individual records or sensitive annotations, (2) the ability to annotate records with rating information, and (3) the ability to keep a personal bibliography per user organized in a user-defined hierarchy as well as branches shared with other users.

Robert Koppa

2005

Learning models for ACM classification

Computer science literature, i.e., books and articles, are classified (manually) according to the ACM classification (e.g., "I.2" is artificial intelligence; ACM = Association for Computing Machinery, one of the big international computer science societies).

The goal of this topic is to learn a model, e.g., a bayesian network, that tries to classify a paper by its metadata, i.e., titel, author, journal or conference, year, etc. This involves methods from text mining / information retrieval, (e.g., variables may be constructed from title keywords ), as well as some advanced data mining methods (e.g., dimensionality reduction as typically a huge number of variables is involved). A database with 200.000 hand-labeled training examples is available.

Okan Basegmez

2005

Attribute-aware Volatile Recommender Systems

Many recommender systems view products as "atomic entities" without any attributes. On the other hand, in most application scenarios (including all e-commerce scenarios) attributes for products are well known. Not using these attributes for the computation of recommendations seems to be a waste of the most valuable information. Volatile recommender systems do not identify users, but are task-driven (see e.g., karstadt).

The task of this topic is to design and implement a framework for the evaluation of volatile recommender systems for products with attributes. By means of an interface to data mining software, different modelling setups and learning algorithms and models should be compared on a real-life dataset.

Rui Xi

2005

Attribute-aware Personalized Recommender Systems

Many recommender systems view products as "atomic entities" without any attributes. On the other hand, in most application scenarios (including all e-commerce scenarios) attributes for products are well known. Not using these attributes for the computation of recommendations seems to be a waste of the most valuable information. Personalized recommender systems identify users and thus should learn user-individual preferences (see e.g., Amazon; you will have to register to use the system).

The task of this topic is to enrich a real-life dataset with product-features by wrapping information from an information portal and to design and implement a framework for the evaluation of personalized recommender systems for products with attributes. By means of an interface to data mining software, different modelling setups and learning algorithms and models should be compared.

V. Ernesto Diaz Aviles

2005

Semantic Peer to Peer Recommender Systems

As recommender systems aim at making available experiences from other users, they typically make use of a centralized information pool, e.g., to compute neighborhoods in collaborative filtering. In a peer to peer scenarios such a central knowledge repository is not available, but only information from peers. Caching strategies or dynamic peer selection have to be used to compute useful recommendations from local information.

The goal of this topic is to implement a simulation framework for semantic peer to peer recommender systems using a simple nearest neighbor based recommendation algorithm running locally on a peer (i.e., having access only to its peers). The domain (e.g., books, music) should be modelled by a domain ontology. Experiments should be run to assess different strategies for caching and peer selection.

Thomas Franz

2005

Design and Prototypical Implementation of a Platform for Personalized Recommender Systems

A platform for personalized recommender systems has to provide access to an online information system and sytematically track users actions and derive preference indicators, e.g., which products does he look at, how long does he stay with a product, which products does he buy, etc. The platform should be able to use arbitrary recommender system models via an interface. Furthermore it should address data management, e.g., allow users to edit and correct preference indicators that have been extracted automatically.

There are several tasks to solve for this topic: First, there has to be conducted an analysis of existing platforms for personalized recommender systems found in the internet (at Amazon and many other shops). Second, requirements for a specific application scenario have to be fixed. Third, a generic database structure for such a system has to be developed. Fourth, a generic prototypical system has to be implemented and set up in a specific application context. Optionally, the thesis may contain some first observations on how users use the platform (preliminary descriptive usage analysis).

Aigulia Kutmanova

2004

A Generic Data Warehouse Model for Recommender Systems

This topic consists of a theoretical and a practical part.

In the theoretical part the state-of-the-art of modelling multidimensional schemas for data warehouses should be researched. The focus here is on conceptual models; implementation models and schema maps (as star and snowflake schema) should not be covered. Results should contain 1) a short description and a structuring of the different approaches proposed, 2) a description of the main differences of the different approaches regarding schema constructs, representation of constructs, general expressiveness of the approaches, and handling of typical modelling problems in DWH, 3) in-depth description of two of the most-promising approaches.

In the practical part a case study should be conducted, building a generic data warehouse model for recommender systems. The main focus here is on the specification of a suitable and flexible model in one of the modelling languages handled in-depth in the first part. At a minimum, core data of anonymous and volatile recommender systems as task profiles, product data, recommendation lists, and preference indicators, and according micro-conversion rates should be modelled.

As indicator of excellence the data warehouse model could be implemented in a prototype that prooves the feasability of the approach. Real-life data for such an experiment is available.

Past master theses at University of Hildesheim

Bachelor projects

Bachelor theses

Master theses