The field of Data Science deals with the theories, methodologies and tools of applying statistical concepts and computational techniques to various data analysis problems related to science, engineering, medicine, business, etc. The objective is to inspect, clean, transform and model data in order to discover useful information, suggest conclusions and support decision-making. It is an emerging topic that plays a critical role in almost every discipline of today’s science and technology and has become an indispensable component.

Data science is a highly interdisciplinary field. Data Science methodologies are mostly derived from statistics theories. The computational algorithms for implementing these statistical methodologies are based upon numerical computation and optimization, and are often executed on a large-scale hardware platform composed of massive computing units and storage devices. These kinds of data analyses can be applied to a wide range of specific problems across the natural and social sciences and serve as the foundation for artificial intelligence. Data Science can be extensively applied to economics, biology, health care, quantitative social science including global health and environmental science, and humanities (e.g., digital media). Numerous new applications are being discovered, and established techniques are being applied in new ways to solve emerging problems. Meanwhile, a variety of career opportunities are open to students with appropriate training in interdisciplinary data science.

Major Requirements

(Not every course listed is offered every term, and the course list will be updated periodically. Please refer to the online Course Catalog for Courses offered in 2021-2022.)

Divisional Foundation Courses

Choose one from the following 2 Math courses

MATH 101
This course offers an introduction to Calculus, a subject that is the foundation for a large part of modern mathematics and has countless applications across the sciences and beyond. The course covers the fundamental Calculus concepts (limits, continuity, differentiation, integration) and explores related applications. The treatment of these concepts assumes no prior knowledge of Calculus. Recommended for students who have not had a previous (high-school level) Calculus course. Students who have had such a Calculus course are recommended to take MATH 105 instead.
MATH 105
Calculus is the foundation for a large part of modern mathematics and has countless applications across the sciences and beyond. This course covers the fundamental Calculus concepts (limits, continuity, differentiation, integration) and explores related applications. The treatment of these topics is rigorous and it involves basic principles of mathematical logic and epsilon-delta language. Recommended for students who have had a previous (high-school level) Calculus course. Not open to students who have credit for MATH 101.

And complete the following courses

PHYS 121
This course is about how to view the world from the perspective of classical mechanics, based on an understanding of the core concepts and theoretical laws. As a science foundation course, it helps students appreciate the elegant simplicity of the universal laws governing the complex systems surrounding us, and it teaches an important approach to identifying, formulating, and solving problems encountered in the physical world. The course begins with the core concepts of classical mechanics _ time, space, mass, force, work, energy, momentum _ and the physical laws that link them with each other. Students first learn NewtonÕs laws and the universal law of gravitation as they apply to point mass systems. Subsequently, basic concepts of oscillation and waves, rigid body motion, fluid mechanics, thermodynamics and statistical mechanics are introduced, illustrated with real-life examples (e.g., physics of cooking, biosphere as a thermal engine) to help students integrate different science foundation courses by themselves. While no previous knowledge of Physics is required, some background is advantageous.
CHEM 110
With an integrated approach, this course examines basic concepts and fundamental principles in chemistry based on the laws of physics. The course starts with an introduction to the static structures of atoms, molecules and matter including life itself, followed by an exploration of the dynamical and collective processes during chemical reactions. It explains how atoms, the basic building blocks of matter, interact with each other and construct the world around us, how subatomic electrons modulate the chemical properties of elements, and how the rearrangement of atoms during chemical reactions gives rise to astonishing phenomena in nature. Centered on topics in chemistry, this course not only prepares students for upper-level disciplinary courses, but also helps students develop an interdisciplinary molecular perspective, which allows them to tackle problems in various fields such as condensed matter physics, molecular biology, medicine, materials science and environmental science. While no previous knowledge is required, some background is advantageous. Not open to students who have credits for both INTGSCI 101 and 102 or CHEM 120
BIOL 110
Integrated Science-Biology employs five themes that describe properties of life and will be reiterated over again in Integrated Science-Biology: Organization (Structure and Function), Cycling of Energy and Matter, Information (Genetic Variation), Homeostasis (Interactions), and Evolution. These themes will be unified under the organizational principles of the Scientific Methods, formulating hypothesis and testing hypothesis with experiments. Students in Integrated Science-Biology will develop the understanding of key concepts in the context of cross-talks with chemistry and physics. While no previous knowledge is required, some background is advantageous.

Interdisciplinary Courses

This course covers maximum likelihood estimation, linear discriminant analysis, logistic regression, support vector machine, decision tree, linear regression, Bayesian inference, unsupervised learning, and semi-supervised learning. Students are not allowed to take both MATH 405 and STATS 302 because of the content overlap. Students who are planning to major in Data Science should take STATS 302.
This course covers statistical inference, parametric method, sparsity, nonparametric methods, learning theory, kernel methods, computation algorithms and advanced learning topics.
This course introduces the principles and methodologies for data acquisition and visualization, along with tools and techniques used to clean and process data for visual analysis. It also covers the practical software tools and languages such as Tableau, OpenRefine and Python/Matlab.
This course covers interdisciplinary applications of data analysis for social science, behavioral modeling, health care, financial modeling, advanced manufacturing, etc. Students are expected to solve a number of practical problems by implementing data algorithms with R during their course projects.
This course covers data and representations, functions, conditions, loops, strings, lists, sets, maps, hash tables, trees, stacks, graphs, object-oriented programming, programming interface and software engineering.

Disciplinary Courses

This course covers probability models, random variables with discrete and continuous distributions, independence, joint distributions, conditional distributions, expectations, functions of random variables, central limit theorem, stochastic processes, random walks, and Markov chains. COMPSCI 201 or COMPSCI 101 or STATS 102 is recommended.
MATH 201
Main topics of this course include vectors and vector functions, the geometry of higher dimensional Euclidean spaces, partial derivatives, multiple integrals, line integrals, vector fields, GreenÕs Theorem, StokesÕ Theorem and the Divergence Theorem.
MATH 202
Systems of linear equations and elementary row operations, Euclidean n-space and subspaces, linear transformations and matrix representations, Gram-Schmidt orthogonalization process, determinants, eigenvectors and eigenvalues; applications.
MATH 205
The fundamental concepts and tools of calculus, probability, and linear algebra are essential to modern sciences, from the theories of physics and chemistry that have long been tightly coupled to mathematical ideas, to the collection and analysis of data on complex biological systems. Given the emerging technologies for collecting and sharing large data sets, some familiarity with computational and statistical methods is now also essential for modeling biological and physical systems and interpreting experimental results. This course is an introduction to probability and statistics with an emphasis on concepts relevant for the analysis of complex data sets. It includes an introduction to the fundamental concepts of matrices, eigenvectors, and eigenvalues.
MATH 304
This course covers Gaussian elimination, LU factorization, Cholesky decomposition, QR decomposition, Newton-Raphson method, binary search, convex function, convex set, gradient method, Newton method, Lagrange dual, KKT condition, interior point method, conjugate gradient method, random walk, and stochastic optimization. Students are not allowed to take both MATH 302 and MATH 304 because of the content overlap. Students who are planning to major in Applied Math and Computational Sciences should take MATH 302 instead, and those who have taken MATH 304 may not major in Applied Math and Computational Sciences.
MATH 305
This course covers pseudo inverse, inner product, vector spaces and subspaces, orthogonality, linear transformations and operators, projections, matrix factorization, and singular value decomposition. COMPSCI 201 or COMPSCI 101 or STATS 102 is recommended
This course covers sorting, order statistics, binary search, dynamic programming, greedy algorithms, graph algorithms, minimum spanning trees, shortest paths, SQL, file organization, hashing, sorting, query, schema, transaction management, concurrency control, rash recovery, distributed database, and database as a service.


Courses listed in the table below are recommended electives for the major and the course list will be updated periodically. Students can also select other courses in different disciplines or divisions as electives.

SOSC 320
Focuses on how to extend statistical techniques learned in Stats 101 and how to apply already learned techniques to real-world social science-related problems. You will learn a) how to select an appropriate model for a given dataset b) how to interpret the diagnostic information from a statistical technique c) what types of social science problems are usually addressed with what kind of statistical model. Covered material includes standard linear model, generalized linear models, bootstrap methods, and additional linear models that rely on specific assumptions about the underlying data. Data in the World utilizes R software in applying models to real-world datasets, including for a final project that can serve as a basis for a signature work.
This course covers uninformed search, informed search, constraint satisfaction, classical planning, neural network, deep learning, hidden Markov model, Bayesian network, Markov decision process, reinforcement learning, active learning and game theory.
This course covers Bayesian inference, prior and posterior distributions, multi-level models, model checking and selection, and stochastic simulation by Markov Chain Monte Carlo.
This course covers cloud infrastructures, virtualization, distributed file system, software defined networks and storage, cloud storage, and programming models such as MapReduce and Spark.
This course covers image formation and representation, camera geometry and calibration, multi-view geometry, stereo, 3D reconstruction from images, motion analysis, image segmentation, and object recognition.
This course covers neural network, deep belief network, Boltzmann machine, convolutional neural network, recurrent neural network, and deep learning applications for speech, image, video, etc.
This course introduces the logical structure of digital media and explores computational media manipulation. The course uses the Python programming language to explore media manipulation and transformation. Topics include spatial and temporal resolution, color, texture, filtering, compression and feature detection.
ECON 211
This course explores the interdisciplinary conversation between economics and artificial intelligence (AI). In experiential education, this course shows how the two disciplines advance each other by an explainable AI approach: economics makes AI more explainable by clarifying causal relationship and AI empowers economic applications by increasing efficiency. Advanced research in Microeconomics, Macroeconomics, and Behavioral and Experimental Economics is covered with both general literature review and a case study. The course concludes with a capstone project where students produce academic research and automated products collaboratively in a team of Economist, Data Scientist, and Data Engineer.
As an introductory course in data science, this course will show students not only the big picture of data science but also the detailed essential skills of loading, cleaning, manipulating, visualizing, analyzing and interpreting data with hands on programming experience. Not open to students who have credit for COMPSCI 101.
This course covers Bayesian network, Markov random field, Gaussian graphical model, message passing, generalized linear model, expectation-maximization, factor analysis, state space model, conditional random field, variational inference, approximate inference, Dirichlet process, kernel graphical model and spectral algorithm.
This course covers Boolean retrieval, dictionary, index, vector space model, score, query, XML, language model, text classification, clustering, and web search.
Topics to be covered: software reliability growth models, software failure data analytics, classical software fault tolerance techniques based on design diversity, novel software fault tolerance techniques based on environmental diversity, classification of software faults, software aging and rejuvenation, and software safety, security and survivability. Statistical methods used in this context, methods of predicting software availability during operation, prediction of time to failure and optimal times to rejuvenate will be discussed. Practical application of these ideas will also be presented via case studies of SDN open source software ONOS and ODL, NASA Satellite on-board software, Apache Webserver and Android operating system.
This course covers speech production and perception, feature extraction, template-based recognition, hidden Markov modeling, language model, sub-word units, robust recognition and applications.