CSCI4180 Introduction to Cloud Computing and Storage - CUHK Course Review

Term Taken: 2020 Fall

Remote Teaching due to COVID19

Instructor: Prof. Lee Pak Ching Patrick

Grading Scheme

Programming Group Assignments x 3 (45%)
Final Exam (55%)

Topics Covered

Cloud Computing Models.
Hadoop: MapReduce, HDFS, YARN.
MapReduce Design Patterns: Pair, Stripe, Order-inversion, Secondary Sorting.
MapReduce Algorithms: Counting, Parallel Dijkstra, PageRank.
Key-value Stores: BigTable, HBase, NoSQL, Amazon Dynamo.
CAP Theorem.
Deduplication: Rabin Fingerprinting, Indexing, Security.
Facebook Haystack, f4.
Zookeeper.
Tail Tolerance.

Review

This was a required course for CS majors who selected the networks/distributed systems/security stream. As the title implied, the two main focuses of the course was the “computing” and “storage” of cloud systems.

For the computing part, we were taught writing Map Reduce programs using Hadoop. Even though a significant portion of the industry left Hadoop and went for Spark nowadays, I felt like it was still very useful to study Map Reduce so that we could see what problems Hadoop tried to solve and why Spark was better at solving certain problems. Furthermore, Map Reduce programming also gained us some insights of programming in a distributed mindset.

The storage part was also very interesting. We went through the architectural designs of a variety of home-made solutions in Big-N companies such as Google’s BigTable and GFS, Facebook’s f4 and HayStack, Dropbox’s deduplication system, and Amazon’s Dynamo. All of them were very impressive and inspiring. Even though the chances of actually participating in these extra-large-scale projects were slim, it was still fun to compare their design tradeoffs and understand how they solved the problems they encountered. It was like peeking into the black boxes of these tech giants and appreciating how a seemingly simple “click” on their websites involved huge tons of intricately engineered infrastructures.

There were three group (2-3 people) programming assignments in this course. All were done in Java. The first two were Map Reduce programming. They were pretty straightforward since most of the algorithms we had to implement were explicitly written in the lecture notes. What was left to do was figuring out how to wire them together by calling Hadoop’s API. The third assignment was a little trickier (yet I enjoyed the most), we had to build a simplified file deduplication backend capable of uploading chunks to Microsoft Azure blob storage. By doing this project, we got to explore how Dropbox reduce its storage significantly by applying the simple trick of deduplication. Also worth mentioning, during the demo of each homework, the running time of some selected tasks were recorded. The top-3 groups with the minimal running time were granted an addition 5 points as a bonus (only groups which passed all the test cases were counted of course). Overall, the assignments were quite interesting and not too time-consuming as long as you know Java.

The final exam was very similar every year. So even though the amount of information in this course was quite huge, it should be easy to get a good score in the exam by looking through past papers.

Highly recommended.

More CUHK Course Reviews