大数据预处理
Undergraduate Level
课程大纲
- 课程信息 (课程大纲)
课件
- 第1章 概论 (课件)
- 第2章 数据缺失及其处理 (课件/代码)
- 第3章 数据纠错与数据格式处理 (课件/代码)
- 第4章 数据离散化 (课件/代码)
- 第5章 异常分布数据处理I:低频分类数据、高偏度数据、异常值 (课件/代码)
- 第6章 异常分布数据处理II: 不平衡数据 (课件/代码)
- 第7章 数据缩放 (课件/代码)
- 第8章 数据归约 (课件/代码)
作业
数据集合
- 数据
- 压缩包含有用于授课的保险公司理赔数据、二手车数据、信用卡诈欺数据。
- 波士顿房价数据为scikit-learn (版本需低于1.2.0)内置数据,可以使用以下代码获取该数据集。
from sklearn.datasets import load_boston
Data Mining and Machine Learning
数据挖掘与机器学习, Graduate Level
- Course Description:
This course provides an introduction to modern techniques for statistical analysis of complex and massive data. It covers most of the machine learning models in supervised learning and unsupervised learning. Examples of these are linear regression, classification, model selection, regularization, basis expansion, kernel smoothing, random forest, and ensemble learning. Applications are discussed as well as computation and theoretical foundations.
- Prerequisites:
- Linear Algebra
- Probability
- Basic Knowledge of Statistical Learning (Reference: An Introduction to Statistical Learning. (Download PDF))
Textbook
- Hastie, T., Tibshirani, R., Friedman, J.H. and Friedman, J.H., 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Vol. 2, pp. 1-758). New York: Springer.
- The textbook is available online. (Download PDF)
Slides
- Lecture 1: Introduction to Statistical Learning and Machine Learning, Supervised Learning, Unsupervised Learning. (Download Slides)
- Chapter 1
- Chapter 2: 2.1 – 2.6
- Lecture 2: Linear Regression, Model Assessment and Selection. (Download Slides)
- Chapter 3: 3.1 – 3.3,
- Chapter 7: 7.1 – 7.5, 7.7
- Lecture 3: Linear Regression with Derived Input Directions, Shrinkage Methods (Regularization), Cross-Validation, Boostrap. (Download Slides)
- Chapter 3: 3.4 – 3.5
- Chapter 7: 7.10, 7.11
- Lecture 4: Regularization (Lasso, Ridge), Logistic Regression. (Download Slides)
- Chapter 3: 3.4, 3.6, 3.8
- Chapter 4: 4.4
- Lecture 5: Kernel Smoothing Methods. (Download Slides)
- Chapter 6: 6.1–6.4, 6.6, 6.8
- Lecture 6: Basis Expansion, Regularization. (Download Slides)
- Chapter 5: 5.1, 5.2, 5.4, 5.5, 5.7
- Recommended Reading Materials: 5.8, 5.9 which are important in functional data analysis and signal processing.
- Lecture 7: Bagging, Additive Models, Trees. (Download Slides)
- Chapter 8: 8.7
- Chapter 9: 9.1, 9.2
- Lecture 8: Nearest-Neighbor, Random Forests. (Download Slides)
- Chapter 13: 13.4
- Chapter 15: 15.1 – 15.4
- Lecture 9: Boosting. (Download Slides)
- Chapter 10: 10.1 – 10.5, 10.10, 10.12
- Lecture 10: Boosting Trees. (Download Slides)
- Chapter 10: 10.9, 10.11
- Lecture 11: Introduction to Ensemble Learning, High-Dimensional Problems. (Download Slides)
- Chapter 16: 16.1 – 16.2
- Chapter 18: 18.1, 18.7
- Lecture 12: Neural Networks.
- Chapter 11
Other Materials
- Lecture Note. Download Latex Template. Compile with XeLaTex.