Master of Science in Engineering in Computer Science

Facoltà di Ingegneria dell'Informazione, Informatica e Statistica
Dipartimento di Ingegneria Informatica, Automatica e Gestionale A. Ruberti

Sapienza Università di Roma

Large Scale Data Management.

Section Big Data Management

2023/2024

prof. Domenico Lembo


For whom is this course. This 3 credit course is actually one of the sections of the course t of the Master of Science in Engineering in Computer Science the Sapienza Università di Roma.

Prerequisites. A good knowledge of the fundamentals of Programming Structures, Programming Languages, Databases (SQL, relational data model, Entity-Relationship data model, conceptual and logical database design) and Database systems.

Course goals. In one sentence, Big Data is data that exceeds the processing capacity of conventional database systems. In particular, Big Data applications deal with huge amounts of data, possibly collected from a huge number of data sources (volume), with highly heterogeneous format (variety), at a very high rate (velocity). This scenario calls for new technologies to be developed, ranging from new data storage mechanisms to new computing frameworks. In this course we will look at several key technologies used in manipulating, storing, and analyzing big data. In particular, we will study architectures for data intensive distributed applications and NoSQL storage solutions.

Lectures

Schedule

Slides

Slides are available at the classroom web page of the course

To access the material you have to register with your institutional account.


Additional Material (suggested -- slides cover all topics in the course)


Exams

There are two modalities for the exam:

(1) Development of a small project. Students are strongly encouraged to propose their own idea for projects. As a suggestion, they can refer to (and also select from) the following list of tools. The project connected to a tool consists, for example, in studying the logical data model(s) adopted by the tool, the native storage data structure it uses, the query language it provides, and highlighting further distinguishing features. Also, a demonstration of the basic use of the tool through one or more examples is required. Presentation connected to projects (possibly through slides) should last around 20 minutes (including the demo).

  1. key-value database tools
    1. Redis
    2. Riak
    3. Memcached
    4. Voldemort
  2. document database tools
    1. Couchbase
    2. MarkLogic (Enterprise NoSQL)
  3. column-family database tools
    1. Cassandra
    2. Hbase
    3. Hypertable

Note: This kind of projects can be developed individually or by groups of two students. In this latter case, presentation should be equally separated into two parts, one managed by each member of the group, and the overall presentation time can be extended to 30-40 minutes.

The exam will consist in the project presentation with possible additional questions on the topics covered by this section of the Large Scale Data Management Course.

To have a project assigned, students must send an email to lembo@diag.uniroma1.it indicating the kind project they are willing to present (please, do not start working on a project before you have it assigned).

(2) Article Presentation

Article presentation consists in preparing a 20 minute presentation about scientific papers assigned by the lecturer or proposed by students. Send an email to lembo@diag.uniroma1.it to ask for the assignment of papers to study as final work (please, do not start studying a paper for exam presentation before you have it assigned).

Note: Article presentation can be carried out only individually

Note: Both project and paper presentations and paper will be preferably carried out during the office ours. Students are however required to send an email in advance to fix the exact date and hour of their presentation.