TU Delft
Education Type
Education print this page print this page     
2013/2014 Electrical Engineering, Mathematics and Computer Science Bachelor Computer Science and Engineering
Responsible Instructor
Name E-mail
Dr.ir. J.C.A. van der Lubbe    J.C.A.vanderLubbe@tudelft.nl
Dr. L.J.P. van der Maaten    L.J.P.vanderMaaten@tudelft.nl
Contact Hours / Week x/x/x/x
0/0/4/0 hc; 0/0/4/0 lab
Education Period
Start Education
Exam Period
Course Language
Course Contents
The goal of the course is to acquaint students with the main techniques for the mining of big data sets. Specifically, the course will cover algorithms for similar-item retrieval, frequent itemset mining, counting of events, network mining, clustering, classification, collaborative filtering, clustering, dimension reduction, and visualization.
Study Goals
After completing the course, the student is able to:
implement standard algorithms from linear algebra and set theory in the MapReduce framework.
implement basic algorithms for the retrieval of similar items.
implement counting algorithms for events in data streams.
explain the PageRank algorithm.
implement basic algorithms for frequent itemset mining.
implement k-means and hierarchical clustering algorithms.
use visualization techniques to obtain insight in data.
explain the workings of basic collaborative filtering algorithms.
explain the basics of social-network graph mining.
understand decision trees and k-nearest neighbor classifiers.
measure the performance of classifiers.
explain the principal components analysis algorithm.
Education Method
The course comprises two lectures and one (two-hour) lab course per week. The lab assignments are mandatory; they can be made individually or in groups of two. They need to be shown to one of the TAs during the lab sessions; the TAs will ask the students questions to confirm that the student understands the implemented algorithm. The lab assignments will comprise programming assignments of about four hours each (two hours in class and two hours at home).

Next to the programming assignments, the course will contain a data mining competition that will be run via Kaggle-in-Class. Students are expected to compete in the competition and to submit a small report on their experiments and results.
Literature and Study Materials
The reading material comes from the book "Mining Massive Datasets" by Anand Rajaraman, Jure Leskovec, and Jeffrey Ullman. In particular, the student will have to read chapters 1, 2, 3, 4, 5, 6, 7, 9, 10, and 11. In addition, the students will read the paper "A Tour through the Visualization Zoo" by Jeffrey Heer, Michael Bostock, and Vadim Ogievetsky. Communications of the ACM 53(6):59-67, 2010; and a reader on decision trees, k-nearest neighbor classification, and evaluation of classifiers.
Prerequisite courses: TI1200 OO Programmeren, TI1310 Algoritmen en Datastructuren, TI2300 Algoritmiek, WI1200-TI Lineaire Algebra, TI1500 Web- en Databasetechnologie, and the two prior courses in the "variantblok".

Specific topics that are assumed as prior knowledge include:
Discrete mathematics: set intersections, unions, and differences.
Linear algebra: matrix multiplication, linear systems, SVD, and eigendecompositions.
Probability and statistics: multivariate Gaussian distribution and correlation and covariance (matrices).
Programming: C++ or Java (TBD) programming skills and the MapReduce programming model.
Data structures: arrays, linked lists, hash tables, and trees.
Graph theory: bipartite graphs and shortest paths.
Databases: natural, inner, and outer joins.

The competition submissions and corresponding report will be graded, and form 25% of the final grade. The remaining 75% of the grade will be determined by a written exam.