Big Data, Small Languages, Scalable Systems

Course instructor: Yanif Ahmad
Class schedule: Wednesday and Friday, 1.30-3.00pm, Shaffer 304
Instructor office hours: Friday 3.00-5.00pm (or by email), Shaffer 200A

Course Description Organization Syllabus Reading List Course material

Course Description

Relational databases and SQL dominate the DBMS software industry, and to a lesser extent the cutting edge of academic research into data management. However, there are a wide variety of applications and platforms where both the relational data model and the design and architecture of popular DBMS are increasingly inappropriate, as the scale and complexity of data, queries, resources, and modes of interaction exacerbate the inadequacies of today's DBMS.

This class will study the state of the art in domain-specific data management tools, with an emphasis on alternative data models, declarative querying and optimization, and the design of extremely scalable system architectures. This course will expose students to the growing categories of applications addressed by databases as they break out of the box of relational DBMS, and will develop students' understanding of how to exploit structure and features in program input data for efficient computation (via query processing).

This year, the course is comprised of three sections, covering popular data models and their implications for query processing at scale, novel architectural aspects including the use of disruptive hardware such as GPUs, and thirdly applications of these data management techniques in other areas of computer science in both industry and academia.

The course workload includes presenting and leading an in-class discussion for two research papers from the list below, as well as writing two short papers on any of the topics covered this semester of their choice, and a take-home midterm in a similar format on a predefined topic. The course also includes a project component, where students may work in small groups or individually on a semester long topic to design and implement a query processing engine for a novel data management application, using existing open-source tools (e.g, Hadoop, Graphlab, Postgres, MongoDB, etc), and research tools (e.g. K3, DBToaster, etc). Students may pick an application of their choice after discussing this with the instructor, or from a predetermined list. [Systems]

Prereq: CS 600.315/415 or equivalent.

Academic Conduct

All activities related to this course are subject to JHU's academic ethics and student conduct policies. Students are also expected to adhere to the Computer Science Academic Integrity Code.

Organization

This course has a discussion-oriented format to introduce students to broader techniques and applications of data management, building on their existing background and experiences with data management tools. Students will read the assigned material prior to the week's classes as preparation for engaging in an in-class discussion to be lead by a student presenter. Students are expected to pick one topic area and present the two papers chosen for that area in two 50 minute lectures for Wednesday and Friday class. To get the most out of the material, my suggestion is that presentations should be done in the style of tutorials at conferences, that is they should include both introductory material on the topic and summarize the contributions of the assigned papers on the topic, rather than purely focus on the papers. Given 75 minute lecture slots, the remaining 25 minutes will be a class discussion and follow-up material by the instructor linking the week's materials to the remaining topics in the syllabus.

For homework assignments, students will write two short papers (3-6 pages) on topics of their choosing based on the areas covered in the course. Students should send the instructor a brief email indicating the titles of their short two papers. This could be a brief survey of the topic area, an extended review of a paper presented in class, or an idea or extension related to a paper. The course midterm will be in a similar (take-home) format, except on a topic question determined by the instructor.

This course includes a significant project component, providing students with an opportunity to develop novel data and query models, and query processing engines for fields that have not traditionally been considered database applications. I have provided a list of potential project topics on the course material page, although you may of course suggest your own topic and discuss the plans with me. Students may work in small groups of 2-3 or individually on the project, and should set the project's scope and goals appropriately. Projects should be demonstrable (that is they should include an interactive implementation). You should prepare a brief proposal document outlining your plans for the project by October 5th, and conclude the project with an interactive presentation (20-25 minutes) at semester's end.

Grading

40% Project
25% Short papers
25% Presentation and discussion lead
15% Midterm

Syllabus


Reading List

Week Topic Presenter Material
September 5 No class
Yanif is out of town
September 12 Welcome, SQL+DBMS intro Yanif Ahmad Course overview, and background material (relational algebra, SQL, DBMS architecture refresher).

Section: Data models
September 19 Sequences, streams Yanif Ahmad Botan et al.: SECRET: A Model for Analysis of the Execution Semantics of Stream Processing Systems. PVLDB 3(1): 232-243 (2010)
W. Lam et al.: Muppet: MapReduce-Style Processing of Fast Data. PVLDB 5(12): 1814-1825 (2012)
September 26 No class
Yanif is out of town
October 3 Nested, NoSQL, NewSQL Naveen, Yanif A. Lakshman, P. Malik: Cassandra: a decentralized structured storage system. Op. Sys. Rev:44(2), 2010
A. Thomson et al.: Calvin: fast distributed transactions for partitioned database systems. SIGMOD 2012.
October 10 Arrays, scientifc data Lakshmisha, Yanif J. Buck et al.: SciHadoop: array-based query processing in Hadoop. SC 2011.
A. Seering et al.: Efficient Versioning for Scientific Array Databases. ICDE 2012: 1013-1024
Short paper 1 due.
October 17 Graphs, recursion and constraints Frank, Yanif J. Mondal, A. Deshpande: Managing large dynamic graphs efficiently. SIGMOD 2012: 145-156
Liu et al.: Cologne: A Declarative Distributed Constraint Optimization Platform. PVLDB 5(8): 752-763 (2012)
Midterm out

Section: Architectures
October 24 I/O Yanif **E. Nightingale et al.: Flat Datacenter Storage. OSDI 2012.
**B. Xie et al.: Characterizing Output Bottlenecks in a Supercomputer. SC 2012.
October 31 Networking Andong, Yanif Corbett et al.: Spanner: Google's Globally-Distributed Database. OSDI 2012.
Zaharia et al.: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012.
Midterm in
November 7 GPUs Nick, Vaibhav Mudalige et al.: Designing OP2 for GPU Architectures. Journal of Par. and Dist. Computing, 2012.
**S. Lee, J. Vetter: Early Evaluation of Directive-Based GPU Programming Models for Productive Exascale Computing. SC 2012.

Section: Applications
November 14 Large web data Abhijeet, Raghu S. Melnik et al.: Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB 3(1): 330-339 (2010)
A. Hall et al.: Processing a Trillion Cells per Mouse Click. PVLDB 5(11):1436-1446, 2012.
Short paper 2 due.
November 21 No class
Thanksgiving
November 28 Large-scale learning Olivia, Aric A. Smola, S. Narayanamurthy: An Architecture for Parallel Topic Models. PVLDB 3(1): 703-710 (2010)
X. Feng et al.: Towards a unified architecture for in-RDBMS analytics. SIGMOD 2012: 325-336
December 5 Crowdsourcing Svitlana, Debu M. Franklin et al.: CrowdDB: answering queries with crowdsourcing. SIGMOD 2011: 61-72
Parameswaran et al.: CrowdScreen: algorithms for filtering data with humans. SIGMOD 2012: 361-372
Final projects due.


** = pending availability


Course material:

All course material will be managed through our Blackboard page.
Look here for course announcements, project ideas, and pdfs for papers.

Past courses

Fall 2011 Fall 2010