CMSC 23310 - Advanced Distributed Systems

Course Staff

Instructor

Ian Foster (foster@uchicago.edu)

Office hours: By appointment.

TA

Rose Pierce (rosepierce@uchicago.edu)

Office hours: TBD

Course Description

In recent years, large distributed systems have taken a prominent role not just in scientific inquiry, but also in our daily lives. When we perform a search on Google, stream content from Netflix, place an order on Amazon, or catch up on the latest comings-and-goings on Facebook, our seemingly minute requests are processed by complex systems that sometimes include hundreds of thousands of computers, connected by both local and wide area networks.

Recent papers in the field of Distributed Systems have described several solutions (such as BigTable, MapReduce, Spanner, Raft, Dynamo, Cassandra) for managing large-scale data and computation. However, building and using these systems poses a number of more fundamental challenges: How do we keep the system operating correctly even when individual machines fail? How do we ensure that all the machines have a consistent view of the system’s state? (and how do we ensure this in the presence of failures?) How can we determine the order of events in a system where we can’t assume a single global clock?

Many of these fundamental problems were identified and solved over the course of several decades, starting in the 1970’s. In this course, we will engage in reading and discussing seminal work in Distributed Systems from the last 35 years to (1) identify the fundamental issues raised in this earlier work, (2) relate those issues to current research problems, and (3) evaluate and compare the solutions proposed in both early and recent work. During this course, students will also implement a distributed system that requires them to manage distribute resources and evaluate whether the resulting system has certain properties, such as reliability, scalability, etc.

At the end of the quarter, students will be able to:

  1. Identify the research questions posed in a scholarly paper and the solutions proposed in that paper.

  2. Identify the main contributions and conclusions in a scholarly paper, and determine whether they are well supported by evaluating and criticizing the arguments, proofs, or experimental results in that paper.

  3. Compare and contrast different distributed systems for managing large-scale data and computation.

  4. Evaluate whether an implementation of a distributed system is reliable, fault-tolerant, scalable, and/or highly available.

A B+ or higher in CMSC 23300 (Networks and Distributed Systems) is a prerequisite for this course. Students can petition to have this requirement waived, as long as they have taken at least one other 200-level CS systems course.

For more details, please see the syllabus.