Data Processing Techniques

Print

Learning Outcomes

The objective of this course is to familiarize students with: (a) learning access methods for large data volumes for various data formats, as well as scalable writing, (b) efficient data storage and retrieval with appropriate indexing techniques, (c) the design and implementation of data processing algorithms aiming at the development of efficient applications that manage data.

Upon successful completion of the course, the students will be in position:

  • to develop data-centric applications with emphasis in efficiency and scalability
  • to use the most appropriate indexing methods for a given problem
  • to evaluate and improve the parts of data processing algorithms that incur high computational load
  • to apply the most suitable data processing techniques that match with data under analysis and for a given query workload
  • to develop efficient data processing algorithms

Course Contents

  • Operation of disk and main memory, serial and random access, cost and efficiency, data locality on disk and main-memory, direct and indirect access, main-memory data structures (arrays, priority queues, hashing)
  • Data access techniques for structured, semi-structured and unstructured data: relational DBs, XML, RDF, text documents, web pages, web APIs, social networks.
  • One-dimensional data and indexing, B-tree, variations (B+tree, B*tree), range queries, inverted indexes.
  • Spatial data, spatial data types, spatial queries, approximation in representation, distance measures, extensions for multidimensional data.
  • Spatial indexing techniques, grid file, spatial indexes (R-tree, QuadTree), space-filling curves (Hilbert, Z-order)
  • Similarity search, k-nearest neighbor search, branch-and-bound algorithms, locality sensitive hashing (LSH), approximate k-nearest neighbor.
  • Top-k search: algorithms based on pre-processing, online algorithms, Fagin’s algorithm, index-based algorithms.
  • Join queries, spatial joins, top-k joins.
  • Spatio-textual data, query types, indexing methods, processing algorithms.
  • Ramakrishnan R. & Gehrke J. (2002): Database Management Systems (3rd Edition), McGraw Hill.
  • N.Mamoulis (2011): Spatial Data Management, Synthesis Lectures on Data Management, Morgan & Claypool.