Learning Outcomes
The course teaches data management methods and programming techniques suitable for big data. The goal of the course is to familiarize students with solving problems that require the processing and storage of different types of data (structured, semi-structured and unstructured), the efficient programming of parallel processing solutions, the design and implementation of scalable algorithms that operate on distributed data collections. Modern big data processing technologies and systems are covered.
Upon successful completion of the course, the students will be in position:
- to understand the basic concepts related to big data
- to understand the problems associated with managing large collections of diverse data
- to solve practical problems of modeling, storing, processing and analyzing big data
- to use tools and programming techniques for big data
- to understand the principles of parallel processing (data sharing, load balancing, scalability, fault tolerance)
- to design algorithms suitable for parallel processing of big data
- to implement algorithms for processing big data in an efficient and scalable manner
General Competences
- Search for, analysis and synthesis of data and information by the use of appropriate technologies,
- Adapting to new situations,
- Decision-making,
- Individual/Independent work,
- Critical thinking,
- Introduction of innovative research,
- Development of free, creative and inductive thinking.
Course Contents
- Modeling of big data, definitions, 6Vs -Volume, Variety, Velocity, Veracity, Validity and Volatility, modeling techniques related to big data, requirements for big data management platforms, the process of big data analysis, challenges related to big data.
- Data storage in cloud computing infrastructures, columnar file storage formats, data compression, selective access to subsets of records.
- Distributed File Systems (DFS), the Hadoop distributed file system, the concept of HDFS blocks, increased availability through replication, characteristic advantages of HDFS.
- Principles of distributed and parallel data management, local and global indexes, partitioning techniques (round-robin, hash-based, range-based), distributed query processing, distributed query optimization, load balancing.
- The MapReduce programming model, map and reduce phases, designing MapReduce jobs, simple and more complex jobs, aggregations, efficient parallel processing, optimization.
- The Hadoop ecosystem for batch processing, implementing MapReduce jobs in Hadoop, programming with Java, programming with Python, creating custom data types.
- Managing data in distributed main memory, the concept of Resilient Distributed Datasets, immutability, Apache Spark.
- Introduction to Spark Dataframes, learning the Dataframe API, declarative processing of big data, the SparkSQL language, execution plans, optimization.
- Real-time processing, basic data flow concepts, data stream management systems, stateful and stateless processing, windowing mechanisms, data stream processing.
Suggested Bibliography
- Παπαδόπουλος Απόστολος, Καρακασίδης Αλέξανδρος, Κολωνιάρη Γεωργία, Γούναρης Αναστάσιος (2024): Τεχνολογίες επεξεργασίας και ανάλυσης μεγάλων δεδομένων. ΚΑΛΛΙΠΟΣ: https://repository.kallipos.gr/handle/11419/14277
- Anand Rajaraman, Jeffrey David Ullman (2020): Mining Massive Datasets, 3rd Edition, Cambridge University Press.
- Raghu Ramakrishnan and Johannes Gehrke (2012): Database Management Systems, 3rd Edition, McGraw-Hill. ISBN: 978-0072465631.
- M. Tamer Özsu, Patrick Valduriez (2011): Principles of Distributed Database Systems, Third Edition. Springer, ISBN 978-1-4419-8833-1, pp. I-XIX, 1-845.
- Tom White (2015): Hadoop: The Definitive Guide – Storage and Analysis at Internet Scale, 4th Edition. O’Reilly Media. ISBN: 9781491901632.
- Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia (2015): Learning Spark: Lightning-fast big data analysis. O’Reilly Media. ISBN: 9781449358624.
- Martin Kleppmann (2017): Designing data-intensive applications. O’Reilly Media. ISBN: 9781491903100.

