Big Data Programming

Learning Outcomes

The course teaches data management methods and programming techniques suitable for big data. The goal of the course is to familiarize students with solving problems that require the processing and storage of different types of data (structured, semi-structured and unstructured), the efficient programming of parallel processing solutions, the design and implementation of scalable algorithms that operate on distributed data collections. Modern big data processing technologies and systems are covered.

Upon successful completion of the course, the students will be in position:

  • to understand the basic concepts related to big data
  • to understand the problems associated with managing large collections of diverse data
  • to solve practical problems of modeling, storing, processing and analyzing big data
  • to use tools and programming techniques for big data
  • to understand the principles of parallel processing (data sharing, load balancing, scalability, fault tolerance)
  • to design algorithms suitable for parallel processing of big data
  • to implement algorithms for processing big data in an efficient and scalable manner

General Competences

  • Search for, analysis and synthesis of data and information by the use of appropriate technologies,
  • Adapting to new situations,
  • Decision-making,
  • Individual/Independent work,
  • Critical thinking,
  • Introduction of innovative research,
  • Development of free, creative and inductive thinking.

Course Contents

  • Modeling of big data, definitions, 6Vs -Volume, Variety, Velocity, Veracity, Validity and Volatility, modeling techniques related to big data, requirements for big data management platforms, the process of big data analysis, challenges related to big data.
  • Data storage in cloud computing infrastructures, columnar file storage formats, data compression, selective access to subsets of records.
  • Distributed File Systems (DFS), the Hadoop distributed file system, the concept of HDFS blocks, increased availability through replication, characteristic advantages of HDFS.
  • Principles of distributed and parallel data management, local and global indexes, partitioning techniques (round-robin, hash-based, range-based), distributed query processing, distributed query optimization, load balancing.
  • The MapReduce programming model, map and reduce phases, designing MapReduce jobs, simple and more complex jobs, aggregations, efficient parallel processing, optimization.
  • The Hadoop ecosystem for batch processing, implementing MapReduce jobs in Hadoop, programming with Java, programming with Python, creating custom data types.
  • Managing data in distributed main memory, the concept of Resilient Distributed Datasets, immutability, Apache Spark.
  • Introduction to Spark Dataframes, learning the Dataframe API, declarative processing of big data, the SparkSQL language, execution plans, optimization.
  • Real-time processing, basic data flow concepts, data stream management systems, stateful and stateless processing, windowing mechanisms, data stream processing.

Suggested Bibliography

  • Παπαδόπουλος Απόστολος, Καρακασίδης Αλέξανδρος, Κολωνιάρη Γεωργία, Γούναρης Αναστάσιος (2024):  Τεχνολογίες επεξεργασίας και ανάλυσης μεγάλων δεδομένων. ΚΑΛΛΙΠΟΣ: https://repository.kallipos.gr/handle/11419/14277
  • Anand Rajaraman, Jeffrey David Ullman (2020): Mining Massive Datasets, 3rd Edition, Cambridge University Press.
  • Raghu Ramakrishnan and Johannes Gehrke (2012): Database Management Systems, 3rd Edition, McGraw-Hill. ISBN: 978-0072465631.
  • M. Tamer Özsu, Patrick Valduriez (2011): Principles of Distributed Database Systems, Third Edition. Springer, ISBN 978-1-4419-8833-1, pp. I-XIX, 1-845.
  • Tom White (2015): Hadoop: The Definitive Guide – Storage and Analysis at Internet Scale, 4th Edition. O’Reilly Media. ISBN: 9781491901632.
  • Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia (2015): Learning Spark: Lightning-fast big data analysis. O’Reilly Media. ISBN: 9781449358624.
  • Martin Kleppmann (2017): Designing data-intensive applications. O’Reilly Media. ISBN: 9781491903100.

Big Data Programming

The course teaches data management methods and programming techniques suitable for big data. The goal of the course is to familiarize students with solving problems that require the processing and storage of different types of data (structured, semi-structured and unstructured), the efficient programming of parallel processing solutions, the design and implementation of scalable algorithms that operate on distributed data collections. Modern big data processing technologies and systems are covered.

Upon successful completion of the course, the students will be in position:

  • to understand the basic concepts related to big data
  • to understand the problems associated with managing large collections of diverse data
  • to solve practical problems of modeling, storing, processing and analyzing big data
  • to use tools and programming techniques for big data
  • to understand the principles of parallel processing (data sharing, load balancing, scalability, fault tolerance)
  • to design algorithms suitable for parallel processing of big data
  • to implement algorithms for processing big data in an efficient and scalable manner

Course Contents

• Modeling of big data, definitions, 6Vs -Volume, Variety, Velocity, Veracity, Validity and Volatility, modeling techniques related to big data, requirements for big data management platforms, the process of big data analysis, challenges related to big data.
• Data storage in cloud computing infrastructures, columnar file storage formats, data compression, selective access to subsets of records.
• Distributed File Systems (DFS), the Hadoop distributed file system, the concept of HDFS blocks, increased availability through replication, characteristic advantages of HDFS.
• Principles of distributed and parallel data management, local and global indexes, partitioning techniques (round-robin, hash-based, range-based), distributed query processing, distributed query optimization, load balancing.
• The MapReduce programming model, map and reduce phases, designing MapReduce jobs, simple and more complex jobs, aggregations, efficient parallel processing, optimization.
• The Hadoop ecosystem for batch processing, implementing MapReduce jobs in Hadoop, programming with Java, programming with Python, creating custom data types.
• Managing data in distributed main memory, the concept of Resilient Distributed Datasets, immutability, Apache Spark.
• Introduction to Spark Dataframes, learning the Dataframe API, declarative processing of big data, the SparkSQL language, execution plans, optimization.
• Real-time processing, basic data flow concepts, data stream management systems, stateful and stateless processing, windowing mechanisms, data stream processing.

Suggested Bibliography

  • Παπαδόπουλος Απόστολος, Καρακασίδης Αλέξανδρος, Κολωνιάρη Γεωργία, Γούναρης Αναστάσιος (2024):  Τεχνολογίες επεξεργασίας και ανάλυσης μεγάλων δεδομένων. ΚΑΛΛΙΠΟΣ: https://repository.kallipos.gr/handle/11419/14277
  • Anand Rajaraman, Jeffrey David Ullman (2020): Mining Massive Datasets, 3rd Edition, Cambridge University Press.
  • Raghu Ramakrishnan and Johannes Gehrke (2012): Database Management Systems, 3rd Edition, McGraw-Hill. ISBN: 978-0072465631.
  • M. Tamer Özsu, Patrick Valduriez (2011): Principles of Distributed Database Systems, Third Edition. Springer, ISBN 978-1-4419-8833-1, pp. I-XIX, 1-845.
  • Tom White (2015): Hadoop: The Definitive Guide – Storage and Analysis at Internet Scale, 4th Edition. O’Reilly Media. ISBN: 9781491901632.
  • Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia (2015): Learning Spark: Lightning-fast big data analysis. O’Reilly Media. ISBN: 9781449358624.
  • Martin Kleppmann (2017): Designing data-intensive applications. O’Reilly Media. ISBN: 9781491903100.

 

 

Structured Representation of Information

Learning Outcomes

The course’s material includes standard technologies and languages of modeling/representation of data/metadata used on the web and web services and how they are implemented in practice with code development in XML, XSL, and XML Schema.

Upon successful completion of the course, the students will be in position to:

  • Explain the basic technologies and languages of data modeling/ representation of data/metadata used on the web and web services.
  • Design and develop programs using XML, XSL and XML Schema.
  • Evaluate metadata modeling and decide whether they follow the given requirements.

Course Contents

  • Introduction to markup languages and semantic web
  • Introduction to XML, basic structure of XML documents
  • Valid XML documents / Use of Document Type Definition (DTD)
  • Presentation of XML documents using CSS
  • XML namespaces
  • Presentation of XML documents using data binding
  • Presentation of XML documents using scripts of Document Object Model (DOM)
  • Transformation and presentation of XML documents using XSLΤ/XSL
  • Modelling of XML documents using XML Schema
  • XML applications

Recommended Readings

  • “XML Guide”, Edition: 1st, Author: S. Holzner, Publisher: M. Gkiourdas, 2009 (1st Book)
  • “XML step by step”, Author: M. J. Young, Publisher: Kleidarithmos Ltd, 2011 (2nd Book)
  • Notes and course slides

Business Process Management

Learning Outcomes

The objective of this course is to present fundamental principles of Business Process Management (BPM) and to study various methods and techniques for analyzing, modeling, automating, executing and optimizing business processes. The course will incorporate a laboratory component with well-known BPM software tools that allow students to practice some of the principles addressed.

Upon successful completion of this course student will be able to:

  • Create business process models by using BPMN based modelling tools
  • Execute business processes by using Business Process Management Systems
  • Analyze the performance of existing business processes and improve business processes that are not sufficient according to certain criteria
  • Create business process management strategies and business processes implementation plans within organizations

Course Content

  1. Business process definition, intra- and inter-organizational processes. Process-oriented organizations. Build processes’ business models. Virtual enterprises. Business processes and workflows.
  2. Process analysis techniques. Qualitative process analysis (e.g. Pareto analysis, value-added analysis, root-cause analysis). Quantitative process analysis (e.g. queuing analysis, simulation). Performance metrics (time, cost, quality).
  3. BPM life cycle. Discover, analyze, model, monitor, map, simulate, deploy. Business Process Reengineering-BPR and Business Process Improvement- BPI methodologies. Business Process modeling tools.
  4. The BPMN standard for business process modelling.
  5. Business process automation. Conceptual and executable process models.
  6. Business Processes Management Systems-BPMS (e.g. structure, architecture, standards).
  7. Process and activity life cycles. Workflow-based applications.
  8. Business processes and workflows, workflow categories, workflow dimensions, workflow management, workflow functional requirements, workflow specifications and execution languages.
  9. Workflow management using a specific BPMS software tool.
  10. Process Analytics. Metrics for evaluating business processes’ performance. Monitoring of standard metrics and process specific, user dined metrics.
  11. BPM methodologies (e.g. Six Sigma, Lean).
  12. Service-oriented and process-oriented information systems.

Suggested Bibliography

  • John Jeston and Johan Nelis (2008): Business Process Management, Second Edition: Practical Guidelines to Successful Implementations, Butterworth-Heinemann, Boston, ISBN: 0750669217.
  • Artie Mahal (2010): How Work Gets Done: Business Process Management, Basics and Beyond, Technics Publications, New Jersey, ISBN: 193550407.
  • Matias Weske, (2010): Business Process Management: Concepts, Languages, Architectures, Springer, New York, ISBN: 3642092640.
  • Simha Magal and Jeffry Word (2009): Essentials of Business Processes and Information Systems, Wiley, New York, ISBN: 0470418540.
  • Howard Smith and Peter Fingar (2003): Business Process Management: The third wave. Meghan Kiffer, ISBN: 0929652339.
  • Mark McDonald, (2010): Improving Business Processes, Harvard Business Review Press, Boston, ISBN: 142212973.
  • Business process management Journal, Emerald.
  • International Journal of Business Process Integration and management, Inderscience Publishers.

Database Systems

Learning Outcomes

The students upon the successful completion of the course will be able:

  • to apply the appropriate techniques for programming and managing database systems
  • to know the basic storage and data organization structures.
  • to apply query processing, query optimization and transaction management mechanisms.
  • to understand the mechanisms that ensure the integrity of the system in the case of multiple concurrent users with access to the same data and database recovery methods in case of failure.

Course Contents

  • Introduction to Database Management Systems (DBMSs). Fundamental concepts of DBMSs, database applications, overview of data models.
  • Data storage and file organization.
  • Query processing methods
  • Query optimization methods.
  • Transaction management: characteristics of a transaction management system.
  • Concurrency Control.
  • Database recovery methods.
  • Parallel and Distributed databases: design, query processing and transaction management in distributed systems.

Recommended Readings

  • Ramakrishnan R. & Gehrke J. (2002): Database Management Systems (3rd Edition), McGraw Hill.
  • Elmasri R. & Navathe S.B. (2007): Fundamentals of Database Systems (5th Edition), Addison-Wesley.