Big Data Programming - Department of Digital Systems

Big Data Programming
Professors
Course category	Scientific expertise
Course ID	DS-537
Credits	5
Lecture hours	3 hours
Lab hours	2 hours
Digital resources	View on Aristarchus (Open e-Class)

The course teaches data management methods and programming techniques suitable for big data. The goal of the course is to familiarize students with solving problems that require the processing and storage of different types of data (structured, semi-structured and unstructured), the efficient programming of parallel processing solutions, the design and implementation of scalable algorithms that operate on distributed data collections. Modern big data processing technologies and systems are covered.

Upon successful completion of the course, the students will be in position:

to understand the basic concepts related to big data
to understand the problems associated with managing large collections of diverse data
to solve practical problems of modeling, storing, processing and analyzing big data
to use tools and programming techniques for big data
to understand the principles of parallel processing (data sharing, load balancing, scalability, fault tolerance)
to design algorithms suitable for parallel processing of big data
to implement algorithms for processing big data in an efficient and scalable manner

Course Contents

• Modeling of big data, definitions, 6Vs -Volume, Variety, Velocity, Veracity, Validity and Volatility, modeling techniques related to big data, requirements for big data management platforms, the process of big data analysis, challenges related to big data.
• Data storage in cloud computing infrastructures, columnar file storage formats, data compression, selective access to subsets of records.
• Distributed File Systems (DFS), the Hadoop distributed file system, the concept of HDFS blocks, increased availability through replication, characteristic advantages of HDFS.
• Principles of distributed and parallel data management, local and global indexes, partitioning techniques (round-robin, hash-based, range-based), distributed query processing, distributed query optimization, load balancing.
• The MapReduce programming model, map and reduce phases, designing MapReduce jobs, simple and more complex jobs, aggregations, efficient parallel processing, optimization.
• The Hadoop ecosystem for batch processing, implementing MapReduce jobs in Hadoop, programming with Java, programming with Python, creating custom data types.
• Managing data in distributed main memory, the concept of Resilient Distributed Datasets, immutability, Apache Spark.
• Introduction to Spark Dataframes, learning the Dataframe API, declarative processing of big data, the SparkSQL language, execution plans, optimization.
• Real-time processing, basic data flow concepts, data stream management systems, stateful and stateless processing, windowing mechanisms, data stream processing.

Suggested Bibliography

Παπαδόπουλος Απόστολος, Καρακασίδης Αλέξανδρος, Κολωνιάρη Γεωργία, Γούναρης Αναστάσιος (2024): Τεχνολογίες επεξεργασίας και ανάλυσης μεγάλων δεδομένων. ΚΑΛΛΙΠΟΣ: https://repository.kallipos.gr/handle/11419/14277
Anand Rajaraman, Jeffrey David Ullman (2020): Mining Massive Datasets, 3^rd Edition, Cambridge University Press.
Raghu Ramakrishnan and Johannes Gehrke (2012): Database Management Systems, 3rd Edition, McGraw-Hill. ISBN: 978-0072465631.
M. Tamer Özsu, Patrick Valduriez (2011): Principles of Distributed Database Systems, Third Edition. Springer, ISBN 978-1-4419-8833-1, pp. I-XIX, 1-845.
Tom White (2015): Hadoop: The Definitive Guide – Storage and Analysis at Internet Scale, 4th Edition. O’Reilly Media. ISBN: 9781491901632.
Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia (2015): Learning Spark: Lightning-fast big data analysis. O’Reilly Media. ISBN: 9781449358624.
Martin Kleppmann (2017): Designing data-intensive applications. O’Reilly Media. ISBN: 9781491903100.

CONTACT DETAILS		FOLLOW US
Address:	Μ. Karaoli & Α. Dimitriou 80, 18534 Piraeus		Official Facebook Link
Tel.:	+30 210 4142235, +30 210 4142426, +30 210 4142373, +30 210 4142076		Official Twitter Link
Fax:	+30 210 4142376		Official Linkedin Link Official Alumni Linkedin Link
e-mail:	gramds@unipi.gr		Official Youtube Link