topics
Hyperparameters are the arguments which are passed to machine learning algorithms. Choosing the right parameters for the given application may be crucial for the performance of the learned model. However, it is usually not possible to know the best hyperparameters beforehand, so the only way to find suitable parameter combinations is to run the algorithm many times with different parameters. This can be quite time-consuming, depending on the amount of data, the algorithm, and the number and range of the hyperparameters.
As the different runs of the learning algorithms are independent of each other, it is straightforward to distribute hyperparameter search over several computers.
- Choose several classifiers from Weka.
- Implement a generic hyperparameter search method for the Sun Grid Engine, including a visualization GUI, for those classifiers.
- Deploy it on ISMLL's cluster infrastructure and test it on different application datasets.
available |
String and word sequence kernels allow the use of kernel-based methods like support-vector machines (SVMs) directly on text, without any (or at least with less) data preprocessing.
The task is to implement one or several string/word sequence kernels for LIBSVM in C++ or Java and then to compare their performance with the standard approach of polynomial kernels with bag-of-words features. Existing code could be used as a starting point.
- H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini: Text classification using string kernels. JMLR, 2002
- T. Kudo, Y. Matsumoto: Fast methods for kernel-based text analysis. ACL 2003
- N. Cancedda, E. Gaussier, C. Goutte, J. M. Renders: Word sequence kernels. JMLR, 2003
- C. H. Teo, S. V. N. Vishwanathan: Fast and space efficient string kernels using suffix arrays. ICML 2006
available |
Folksonomies are user-generated, flat and lightweight vocabularies that can help to organize massive amounts of data items on websites. Collaborative filtering (CF) is a key technology for recommender systems (RS). It is based on the assumption that users who bought/clicked/rated similar items will also perform similarly on so far unobserved items. Because of their widespread adoption in domains like online shopping (see amazon.com for an example), and because of the one-million-dollar Netflix Prize, CF and RS have gained publicity in past years,
The task is to implement several state-of-the-art collaborative filtering algorithms which also take into account folksonomy data, and evaluate them on public datasets.
available |
Eamonn Keogh et al. [Keogh 2006] have introduced a new method for shape recognition. It is based on the conversion of shapes on an image into time series.
The task consists of 3 subtasks:
- review shape recognition methods,
- re-implement the method proposed by Keogh and evaluate it against shape matching implemented in OpenCV,
- adapt the method proposed by Keogh for a special task like recognition of traffic signs.
- E. Keogh, L. Wei, X. Xi, S. H. Lee, M. Vlachos: LB_Keogh supports Exact Indexing of Shapes under Rotation Invariance with Arbitrary Representations and Distance Measures. VLDB, 2006
- D. Yankov, E. Keogh, J. Medina, B. Chiu, V. Zordan: Detecting Time Series Motifs Under Uniform Scaling. KDD, 2007
- X. Wang, L. Ye, E. Keogh, Ch. Shelton: Annotating historical archives of images. ACM/IEEE-CS joint conference on Digital libraries, 2008
available |
The Data Mining Cup is an annual data analysis competition for undergraduate students. Participants receive a real-life dataset, the task is then to make predictions using the given data.
This year's competition will start on April 15 and last until May 25, so the main workload of this project will be in those 40 days.