Multidisciplinary Research and Education on Big Data + High-Performance Computing + Atmospheric Sciences

Part of NSF Initiative on Workforce Development for Cyberinfrastructure (CyberTraining)

Course Syllabus


We propose to design a “Big Data + HPC + Atmospheric Sciences” course addressing the four challenges above through the following innovative approaches:
  1. it will teach students in atmospheric Sciences how to implement and run parallel and big data programs at an HPC facility.
  2. it will teach students in computing and applied mathematics how to solve atmospheric Sciences challenges by applying their knowledge.
  3. it will provide distinctive learning outputs and homework to fit the background and interests of students in different disciplines.
  4. it will provide team-based frontier research projects where each team is composed with students in different disciplines so they can collaborate and contribute from their own research interests.

Our proposed 15-module multidisciplinary course includes
  1. customized course design for three disciplines with commonalities and differences.
  2. data and computing techniques adoption for Atmospheric Sciences (three/four modules each for Data Science, HPC and Atmospheric Sciences).
  3. identification of open challenges (including related open data) that can benefit from advanced CI resources and techniques.
  4. five weeks long team-based project for frontier research challenges.
  5. open source CI software implementation.
  6. publications from the designed research projects. If taught during a regular semester, the workload is equivalent to that of a three-credit course.

Course Structure

Module Topic Goal
1 Introduction of Python/C, Linux and HPC environment Running their own jobs on HPC. Running their own jobs on HPC.
2 Numerical methods for Partial Differential Equations (PDE) Model as PDE and solve them using numerical methods. Model as PDE and solve them using numerical methods.
3 Message Passing Interface (MPI) Write MPI jobs and performance studies. Write MPI jobs and performance studies.
4 Introduction of Data Science Know basic tasks and techniques of Data Science. Know basic tasks and techniques of Data Science.
5 Basics of Big Data Understand the basics of Big Data and demo programs. Understand the basics of Big Data and demo programs.
6 Big Data system: Hadoop/Spark Write Hadoop/Spark jobs and run them on HPC. Write Hadoop/Spark jobs and run them on HPC.
7 Basics of Machine Learning Write a machine learning program using Spark MLlib. Write a machine learning program using Spark MLlib.
8 Basics of earth-atmosphere radiative energy balance and global warming Understand basic concepts and principles of radiative energy balance and global warming. Understand basic concepts and principles of radiative energy balance and global warming.
9 Basics of radiative transfer simulation framework Understand the basic physics underlying the transport of radiation in atmosphere. Understand the basic physics underlying the transport of radiation in atmosphere.
10 GCM simulation and satellite observations Understand the importance of GCM and satellite remote sensing. Understand the importance of GCM and satellite remote sensing.
11 Project introduction and assignment Each interdisciplinary team will be assigned one project. Each interdisciplinary team will be assigned one project .
12-14 Project progress report from each team and feedback 20 minutes report from each team + Q&A + rating. 20 minutes report from each team + Q&A + rating.
15 Final project presentation Report, software, and a final presentation from each team. Report, software, and a final presentation from each team.


Module 1:
Introduction of Python/C, Linux and HPC environment. The first module explains the whole structure of the program and required basic knowledge for the program. It briefly goes through a programming language such as Python or C. It also introduces the hardware architecture, available software and basic usage of the UMBC HPCF environment.

Module 2:
Numerical Methods for Partial Differential Equations. This module will explain the basics of partial differential equations, which is commonly used in physical models. It will discuss the use of numerical methods for PDEs, which is one major driving force behind research in many other fields like numerical linear algebra, scientific computing, and the development of parallel computers. It will cover the three basic PDE categories and their mathematical properties with examples. It will discuss two large classes of methods: finite difference and finite element methods.

Module 3:
Message Passing Interface (MPI). This module will explain how to write MPI programs which is one of most common approach to build portable and scalable parallel scientific applications. It will cover basic MPI commands such as MPI_Send and MPI_Recv, collective communication commands like MPI_Bcast, MPI_Reduce/MPI_Allreduce, and MPI_Gather/MPI_Scatter. It will also explain how to write MPI programs in both C and Python (through mpi4py)

Module 4:
Introduction of Data Science. This module will explain the basic concepts of Data Science, including generic lifecycle and different stages of data analytics, such as acquisition, cleaning/preprocessing, integration/aggregation, analysis/modeling and interpretation. It will cover basics of descriptive statistics, graphic displays of data summaries, and basics of probability theory (including Bayes’ theorem).

Module 5:
Basics of Big Data. This module will explain the basics of Big Data, including its 5V characteristics. It starts with the challenges and bottleneck of many applications when dealing with large volume of data. Then it will introduce the basics of distributed file system and why we need them. It will cover Big Data concepts/techniques: data partitioning, data parallelization, key-value pairs, functional programming and MapReduce.

Module 6:
Big Data system: Hadoop/Spark. This module will cover how to use two popular Big Data systems namely Hadoop and Spark. It will explain how Hadoop Distributed File System (HDFS) can achieve data partitioning, and fault tolerance and cluster management and job scheduling in Hadoop/Spark. For Spark, it will explain resilient distributed datasets (RDD), RDD transformations (map, join, cogroup, etc.) and actions (count, collection, foreach, etc.), lazy evaluation.

Module 7:
Basics of Machine Learning. This module will explain the main lifecycle (training, testing, applying) and main types of machine learning (supervised and unsupervised learning). Major techniques to be covered include inferential statistics, feature selection, regression, correlation, clustering and classification. It will also explain how to construct Big Data machine learning through Spark MLlib.

Module 8:
Basics of earth-atmosphere radiative energy balance and global warming. This module will explain the basic concepts and principles that control the radiative energy balance of earth-atmosphere system, and its implications to climate. The module will start with the fundamental physics, such as black-body radiation, followed by zero-order radiative energy balance between incoming solar radiation and outgoing terrestrial longwave radiation. The module will end with discussion of what kinds of roles the greenhouse gases, aerosols and clouds play in the radiative energy budget.

Module 9:
Basics of radiative transfer simulation framework. Following previous module, this module will introduce the fundamental physical principles that control the transport of radiation (i.e., visible and infrared light) in our atmosphere. The module will also include the introduction of Monte-Carlo method and it application to radiative transfer.

Module 10:
GCM simulation and satellite observations. This module will start with an introduction to the basic concepts and principles of numerical climate simulations, followed by explaining the importance of evaluating climate simulations and why satellite remote sensing products are invaluable for climate model evaluation. Basic concepts and principle underlying satellite remote sensing will also be introduced this module.

Module 11:
Project introduction and assignment. This module will explain available research projects to be conducted in the following five weeks (see below for possible projects). For each project, it will cover the required techniques, suggested phases and major tasks, expected outputs, output evaluation metrics and challenges to each discipline. Each team will be assigned one project to work on.

Module 12-14:
Project progress report from each team and feedback from instructors as well. These three modules will be weekly project progress updates and discussions. Since each team has three members, every member will be a presenter for the reports. All instructors and other teams will discuss the progress, perform peer review, provide feedback and give ratings.

Module 15:
Final project presentation. The final module will be the final project presentation and final CI software program and technical report delivery. Each team will give a talk on the problems to be solved, the approaches taken, demonstration of developed software program, the experiments and results, and contributions of each member. All instructors and other teams will provide feedback and give ratings and suggestions for future work.