实验一：波士顿房价预测报告

摘要

本报告详细记录了使用机器学习技术对经典波士顿房价数据集进行回归预测的实验过程。实验的核心目标是根据 13 个描述房屋及周边环境的特征（如房间数 RM、低收入人群比例 LSTAT、犯罪率 CRIM 等），构建能够准确预测自住房屋房价中位数 (MEDV) 的模型。实验遵循标准的机器学习工作流程，首先对数据进行了加载和全面的探索性数据分析（EDA）。通过可视化手段（如直方图、相关性热力图、散点图）揭示了数据的关键特性，包括部分特征的偏态分布、目标变量在高值处的截断现象、RM 与 LSTAT 对房价的强相关性以及特征间的共线性问题（如 RAD 与 TAX）。

This report details the experimental process of using machine learning techniques for regression prediction on the classic Boston housing dataset. The core objective of the experiment is to build a model capable of accurately predicting the median value of owner-occupied homes (MEDV) based on 13 features describing the house and its surrounding environment (such as number of rooms RM, percentage of lower status population LSTAT, crime rate CRIM, etc.). The experiment follows a standard machine learning workflow, starting with data loading and comprehensive Exploratory Data Analysis (EDA). Visualization methods (like histograms, correlation heatmaps, scatter plots) revealed key data characteristics, including the skewed distribution of some features, truncation of the target variable at high values, the strong correlation of RM and LSTAT with house prices, and collinearity issues between features (e.g., RAD and TAX).

基于 EDA 的发现，对数据进行了必要的预处理，核心步骤是使用 StandardScaler 对特征进行标准化，以消除量纲差异对模型训练的影响，同时探讨了添加交互特征和多项式特征的可能性。随后，系统性地训练和评估了多种常见的回归模型，涵盖了线性模型（线性回归、岭回归、Lasso 回归）和更复杂的集成模型（随机森林、梯度提升树、XGBoost）。模型性能通过均方根误差（RMSE）和决定系数（R²）进行量化，并利用 5 折交叉验证来评估其泛化能力和稳定性。评估结果（通过终端输出和可视化图表 output/model_comparison.png 展示）普遍表明，集成学习方法（特别是 XGBoost）在此任务上表现优于简单的线性模型。报告还讨论了超参数调优和特征重要性分析（output/feature_importance.png）作为进一步优化和理解模型的潜在步骤。最终，利用选定的最优模型对测试集进行预测，生成了预测结果文件 (output/submission.csv)，并通过预测值与真实值对比图 (output/actualpricevspredictedprice.png) 直观展示了模型效果。整个实验不仅完成了一个具体的预测任务，也完整地实践和展示了机器学习项目从数据理解到模型评估与应用的全过程。

Based on the EDA findings, necessary data preprocessing was performed, with the core step being feature standardization using StandardScaler to eliminate the impact of scale differences on model training, while also exploring the possibility of adding interaction and polynomial features. Subsequently, various common regression models were systematically trained and evaluated, covering linear models (Linear Regression, Ridge Regression, Lasso Regression) and more complex ensemble models (Random Forest, Gradient Boosting Tree, XGBoost). Model performance was quantified using Root Mean Squared Error (RMSE) and Coefficient of Determination (R²), and 5-fold cross-validation was used to assess generalization ability and stability. The evaluation results (presented via terminal output and visualization chart output/model_comparison.png) generally indicated that ensemble learning methods (especially XGBoost) outperformed simple linear models on this task. The report also discussed hyperparameter tuning and feature importance analysis (output/feature_importance.png) as potential steps for further optimization and model understanding. Finally, the selected optimal model was used to make predictions on the test set, generating a prediction file (output/submission.csv), and the model's effectiveness was visualized through a comparison chart of predicted versus actual values (output/actualpricevspredictedprice.png). The entire experiment not only completed a specific prediction task but also fully practiced and demonstrated the complete process of a machine learning project, from data understanding to model evaluation and application.

1. 引言

1.1 项目背景与意义

在现代社会经济活动中，房地产市场扮演着至关重要的角色。房价不仅是衡量地区经济发展水平、居民生活成本的关键指标，也直接关系到个体家庭的资产配置、投资决策乃至整体金融市场的稳定。因此，对房价进行准确的预测和分析，无论是对于政府制定宏观调控政策、企业进行投资决策，还是个人进行房产买卖，都具有极其重要的现实意义和应用价值。机器学习技术的蓬勃发展 [1, 2]，为我们提供了强大的工具来处理复杂的数据集，挖掘隐藏在数据背后的模式，从而实现对房价等复杂经济现象的有效预测。通过构建精准的房价预测模型，我们可以更好地理解影响房价波动的关键因素，洞察市场趋势，为相关决策提供科学依据，减少信息不对称带来的风险。

本项目选取了经典的"波士顿房价数据集"作为研究对象。该数据集虽然年代较为久远（收集于 20 世纪 70 年代），但因其包含了丰富多样的、与房价相关的真实世界特征，并且数据量适中、易于获取，成为了机器学习领域，特别是回归问题研究中，广泛使用的基准数据集之一。通过对这个数据集进行分析和建模，我们可以系统性地学习和实践机器学习在回归预测任务中的标准流程，包括数据探索、特征工程、模型选择、训练评估以及结果解释等关键环节。这对于初学者掌握机器学习基本原理、熟悉常用算法库（如 Scikit-learn [1]）以及理解模型评价指标具有重要的教学价值。同时，尽管数据集本身存在一定的局限性（如时间滞后性、样本量相对较小等），但其揭示的影响房价的基本因素（如地理位置、房屋属性、社区环境、宏观经济指标等）在今天仍然具有一定的参考意义。通过本次实验，我们旨在深入理解如何运用机器学习方法解决实际的回归预测问题，并为后续更复杂、更现代的房地产市场分析打下坚实的基础。

1.2 实验目标

本次实验的核心目标是利用提供的波士顿房价训练数据集 (train.csv)，构建一个能够准确预测房屋价格中位数 (MEDV) 的机器学习回归模型。具体而言，实验包含以下几个子目标：

The core objective of this experiment is to utilize the provided Boston housing training dataset (train.csv) to build a machine learning regression model capable of accurately predicting the median housing value (MEDV). Specifically, the experiment includes the following sub-objectives:

数据理解与探索: 深入理解数据集的背景、各个特征的含义，利用统计分析和可视化手段（如直方图、散点图、相关性热力图等）探索数据的分布特性、特征之间的关系以及特征与目标变量（房价）的潜在联系，识别异常值或缺失值（尽管本数据集中无缺失值）。
数据预处理与特征工程: 根据数据探索的发现，对数据进行必要的预处理，例如特征标准化（考虑到不同特征的量纲差异巨大），并可能尝试创建新的、更有预测能力的特征（如交互特征、多项式特征）来提升模型性能。
模型选择与训练: 选用多种经典的机器学习回归算法，包括线性模型（线性回归、岭回归、Lasso 回归）和集成模型（随机森林 [3]、梯度提升树 [4]、XGBoost [5]），在处理后的训练数据上进行模型训练。
模型评估与比较: 使用合适的评估指标（如均方根误差 RMSE、决定系数 R²）和交叉验证技术，客观地评估和比较不同模型的性能表现，理解各模型的优缺点和适用场景。
模型优化 (可选): 对表现较好的模型（如 XGBoost）进行超参数调优，寻找最佳参数组合以进一步提升预测精度。
特征重要性分析 (可选): 分析最终模型给出的特征重要性排序，理解哪些因素对波士顿房价影响最大，为模型提供一定的可解释性。
预测与结果生成: 利用训练好的最优模型，对测试数据集 (test.csv) 进行房价预测，并将预测结果按照指定格式 (sample.csv) 保存到 output/submission.csv 文件中。

Data Understanding and Exploration: Deeply understand the dataset background, the meaning of each feature, use statistical analysis and visualization methods (e.g., histograms, scatter plots, correlation heatmaps) to explore data distribution characteristics, relationships between features, and potential links between features and the target variable (house price), identify outliers or missing values (although there are no missing values in this dataset).
Data Preprocessing and Feature Engineering: Based on data exploration findings, perform necessary data preprocessing, such as feature standardization (considering the large differences in scale among features), and potentially try creating new, more predictive features (e.g., interaction features, polynomial features) to enhance model performance.
Model Selection and Training: Select and train various classic machine learning regression algorithms, including linear models (Linear Regression, Ridge Regression, Lasso Regression) and ensemble models (Random Forest [3], Gradient Boosting Tree [4], XGBoost [5]), on the processed training data.
Model Evaluation and Comparison: Use appropriate evaluation metrics (e.g., Root Mean Squared Error RMSE, Coefficient of Determination R²) and cross-validation techniques to objectively evaluate and compare the performance of different models, understanding their pros, cons, and applicability.
Model Optimization (Optional): Perform hyperparameter tuning on well-performing models (e.g., XGBoost) to find the optimal parameter combination for further improving prediction accuracy.
Feature Importance Analysis (Optional): Analyze the feature importance ranking provided by the final model to understand which factors have the greatest impact on Boston housing prices, providing some interpretability to the model.
Prediction and Result Generation: Use the trained optimal model to predict house prices for the test dataset (test.csv) and save the prediction results in the specified format (sample.csv) to the output/submission.csv file.

通过完成以上目标，我们期望不仅能得到一个性能良好的房价预测模型，更能全面掌握机器学习项目从数据到模型、再到结果解释的完整流程。

1.3 数据集概述

本次实验使用的数据集是波士顿房价数据集 [6]，具体包含以下文件：

data/exp1/train.csv: 训练数据，包含 404 个样本，每个样本有 13 个特征和 1 个目标变量 (MEDV)。
data/exp1/test.csv: 测试数据，包含 102 个样本，每个样本有 13 个特征，需要预测其对应的 MEDV。
data/exp1/sample.csv: 提交结果的示例文件，展示了 submission.csv 应有的格式。
data/exp1/README.md: 数据集的描述文件，包含了各特征的详细含义。

数据集包含的 13 个特征如下：

CRIM: 城镇人均犯罪率
ZN: 占地面积超过 25,000 平方英尺的住宅用地比例
INDUS: 城镇非零售商业用地所占比例
CHAS: 是否邻近查尔斯河（1 表示是，0 表示否）- 分类型特征
NOX: 一氧化氮浓度（百万分之一）- 环境指数
RM: 每栋住宅的平均房间数
AGE: 1940 年以前建成的自住单位比例
DIS: 与波士顿五个就业中心的加权距离
RAD: 距离高速公路的便利指数
TAX: 每万美元的不动产税率
PTRATIO: 城镇师生比例
B: 计算公式为 1000(Bk - 0.63)^2，其中 Bk 是城镇非裔美国人的比例
LSTAT: 房东属于低收入阶层人口的百分比

目标变量：

MEDV: 自住房屋房价中位数（单位：千美元）

需要注意的是，该数据集的特征类型多样，取值范围差异很大，且部分特征（如 CRIM, ZN, B）呈现明显的偏态分布，这在数据预处理和模型选择时需要特别考虑。同时，特征之间可能存在多重共线性（如 RAD 和 TAX 高度相关），这也可能影响某些模型的性能和解释性。实验代码位于 exp1/exp1.py，它将执行上述所有分析、处理和建模步骤，并依赖 NumPy [7] 和 Pandas [8] 等库进行数据处理，最终将可视化结果和预测文件输出到根目录下的 output 文件夹。

It is important to note that this dataset has diverse feature types, widely varying value ranges, and some features (like CRIM, ZN, B) exhibit significant skewness, which requires special consideration during data preprocessing and model selection. Additionally, multicollinearity might exist among features (e.g., RAD and TAX are highly correlated), potentially affecting the performance and interpretability of certain models. The experimental code is located in exp1/exp1.py, which will execute all the analysis, processing, and modeling steps mentioned above, relying on libraries like NumPy [7] and Pandas [8] for data handling, and finally output visualization results and prediction files to the output folder in the root directory.

2. 数据加载与探索性数据分析 (EDA)

2.1 数据加载

脚本首先使用 pandas 库加载训练数据集 (train.csv) 和测试数据集 (test.csv)。根据 exp1.py 中的代码，数据文件是以空格分隔的，并且没有表头，因此在加载时指定了 delim_whitespace=True 和 header=None。加载后，为数据框添加了有意义的列名，如 'CRIM', 'ZN', 'RM', 'LSTAT', 'MEDV' 等。测试集比训练集少 'MEDV' 这一列。

The script first uses the pandas library to load the training dataset (train.csv) and the test dataset (test.csv). According to the code in exp1.py, the data files are space-delimited and have no header, so delim_whitespace=True and header=None were specified during loading. After loading, meaningful column names such as 'CRIM', 'ZN', 'RM', 'LSTAT', 'MEDV', etc., were added to the dataframes. The test set lacks the 'MEDV' column compared to the training set.

2.2 探索性数据分析

脚本执行了详细的 EDA 来理解数据特性：

基本信息: 打印了数据的基本统计描述（如均值、标准差、分位数）和信息（数据类型、非空值数量）。确认了数据没有缺失值。
特征分布: 使用 seaborn 和 matplotlib 可视化了所有 13 个特征以及目标变量 MEDV 的分布直方图。具体可参见下图：

特征分布

分析：从特征分布图中可以看出，许多特征并非理想的正态分布。例如 CRIM（犯罪率）、ZN（大面积住宅比例）、B（非裔美国人比例）呈现明显的右偏（正偏态），意味着大部分城镇这些指标的值较低，少数城镇的值很高。CHAS（是否临河）是一个二元变量。目标变量 MEDV（房价中位数）的分布相对接近正态分布，但尾部可能存在一些高价房产，且在最高值附近（50k美元）有截断现象（意味着部分房价可能超过了记录上限）。了解这些分布有助于选择合适的数据转换方法（如对数变换）或对模型选择提供参考。

Analysis: The feature distribution plots show that many features do not follow an ideal normal distribution. For example, CRIM (crime rate), ZN (proportion of large lots), and B (proportion of Black residents) are significantly right-skewed (positive skewness), meaning most towns have low values for these indicators, while a few towns have very high values. CHAS (bordering the river) is a binary variable. The distribution of the target variable MEDV (median house value) is relatively close to a normal distribution, but there might be some high-priced properties in the tail, and there is a truncation phenomenon near the maximum value (50k USD), implying some house prices might have exceeded the recording limit. Understanding these distributions helps in selecting appropriate data transformation methods (like logarithmic transformation) or provides references for model selection.

目标变量分布

分析：目标变量 MEDV 的分布图更清晰地显示了其近似正态的特性，但如上所述，在 50k 美元处有一个明显的峰值，这可能是数据收集时的上限或截断点，处理时需要注意。
这有助于了解每个变量的分布形态（如是否偏态）。

Analysis: The distribution plot of the target variable MEDV more clearly shows its approximately normal characteristic, but as mentioned above, there is a distinct peak at 50k USD, which might be a cap or truncation point during data collection, requiring attention during processing.
This helps understand the distribution shape of each variable (e.g., whether it is skewed).

相关性分析:
- 计算了所有变量之间的皮尔逊相关系数，并通过热力图进行可视化：

3. 特征工程与预处理

在模型训练之前，进行了必要的特征工程和预处理：

特征与目标分离: 将训练数据分为特征矩阵 X 和目标向量 y (MEDV)。
特征缩放: 由于不同特征的取值范围差异很大（例如，ZN 在 0-100 之间，CHAS 是 0 或 1），脚本使用了 sklearn.preprocessing.StandardScaler 对所有特征进行了标准化（转换为均值为 0，标准差为 1 的分布）。这对许多依赖于距离或梯度的模型（如线性回归、岭回归、Lasso、SVM 等）至关重要，可以防止某些特征因数值范围过大而主导模型训练。测试集的特征也使用在训练集上 fit 过的同一个 scaler 进行 transform，以保证一致性。
特征创建 (可选): 脚本中还包含了创建新特征的代码（尽管可能被注释掉或作为可选步骤）：
- 交互特征: 例如，RM 和 LSTAT 的乘积 (RM_LSTAT)，尝试捕捉这些强相关特征之间的交互效应。
- 多项式特征: 例如，RM 和 LSTAT 的平方项 (RM_sq, LSTAT_sq)，尝试捕捉可能的非线性关系。

这些处理步骤旨在提高模型的性能和稳定性。

3.1 特征工程伪代码解释

以下是特征工程与预处理部分 (feature_engineering 函数) 的伪代码解释：

Below is the pseudo-code explanation for the feature engineering and preprocessing part (feature_engineering function):

FUNCTION feature_engineering(train_data, test_data):
    // 打印开始信息
    PRINT "开始特征工程..."

    // 1. 复制数据框以避免修改原始数据
    train_processed = COPY(train_data)
    test_processed = COPY(test_data)

    // 2. 分离特征和目标变量
    X = train_processed.DROP_COLUMN('MEDV') // 训练集特征
    y = train_processed.GET_COLUMN('MEDV')  // 训练集目标

    // 3. 特征标准化
    // 初始化标准化器
    scaler = INITIALIZE StandardScaler()
    // 使用训练特征 X 拟合标准化器，并转换 X
    X_scaled = scaler.FIT_TRANSFORM(X)
    // 使用 *同一个* 拟合好的标准化器转换测试特征
    test_scaled = scaler.TRANSFORM(test_processed)

    // 4. 将标准化后的 NumPy 数组转换回 Pandas DataFrame，保留列名
    X_scaled_df = CREATE_DATAFRAME(X_scaled, columns=X.columns)
    test_scaled_df = CREATE_DATAFRAME(test_scaled, columns=test_processed.columns)

    // 5. 可选步骤：创建新特征 (示例)
    // 创建交互特征 RM * LSTAT
    X_scaled_df['RM_LSTAT'] = X_scaled_df['RM'] * X_scaled_df['LSTAT'] * -1
    test_scaled_df['RM_LSTAT'] = test_scaled_df['RM'] * test_scaled_df['LSTAT'] * -1

    // 创建多项式特征 RM^2 和 LSTAT^2
    X_scaled_df['RM_sq'] = X_scaled_df['RM'] ** 2
    test_scaled_df['RM_sq'] = test_scaled_df['RM'] ** 2
    X_scaled_df['LSTAT_sq'] = X_scaled_df['LSTAT'] ** 2
    test_scaled_df['LSTAT_sq'] = test_scaled_df['LSTAT'] ** 2

    // 6. 返回处理后的数据
    RETURN X_scaled_df, y, test_scaled_df

4. 模型训练与评估

脚本采用了系统化的方法来训练和评估多个常见的回归模型：

数据划分: 使用 sklearn.model_selection.train_test_split 将处理后的训练数据划分为训练集和验证集（例如，80% 训练，20% 验证），用于评估模型在未见过的数据上的泛化能力。
模型选择: 训练了多种回归模型，涵盖了从基础到先进的算法，以便进行全面的比较：
- 线性回归 (LinearRegression)
  - 原理: 最基础的回归算法，旨在找到一条最佳拟合直线（或超平面）来描述特征 \(X\) 与目标变量 \(y\) 之间的线性关系。它假设目标变量是特征的线性组合加上一个误差项。
  - 公式: 模型形式为 \( \hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p \)，其中 \( \hat{y} \) 是预测值，\( x_1, ..., x_p \) 是特征，\( \beta_0 \) 是截距项，\( \beta_1, ..., \beta_p \) 是特征对应的系数。模型通过最小化残差平方和 (Residual Sum of Squares, RSS) 来估计系数 \( \beta \)： \[ \text{RSS} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - (\beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}))^2 \]
- 岭回归 (Ridge) - L2 正则化
  - 原理: 线性回归的一种正则化形式，主要用于处理特征间存在多重共线性（高度相关）的情况以及防止模型过拟合。它在线性回归的损失函数（RSS）基础上增加了一个 L2 正则化项（系数的平方和）。
  - 公式: 岭回归的优化目标是最小化： \[ \sum_{i=1}^{n} (y_i - (\beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}))^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \] 其中 \( \lambda \ge 0 \) 是调整正则化强度的超参数。\( \lambda \) 越大，系数 \( \beta_j \) 被压缩得越接近于 0，但不会变为严格的 0。这有助于降低模型复杂度，提高模型的泛化能力。
- Lasso 回归 (Lasso) - L1 正则化
  - 原理: 另一种线性回归的正则化形式，与岭回归类似，但它使用的是 L1 正则化项（系数的绝对值之和）。Lasso 的一个重要特性是它能够将某些特征的系数精确地压缩到 0，从而实现特征选择的效果。
  - 公式: Lasso 回归的优化目标是最小化： \[ \sum_{i=1}^{n} (y_i - (\beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}))^2 + \lambda \sum_{j=1}^{p} |\beta_j| \] 同样，\( \lambda \ge 0 \) 是正则化强度的超参数。当 \( \lambda \) 足够大时，一些不重要的特征系数会变成 0，使得模型更稀疏、更易于解释。
- 随机森林 (RandomForestRegressor) - 集成学习（Bagging）
  - 原理: 一种强大的集成学习算法，属于 Bagging（Bootstrap Aggregating）方法的扩展。它构建多个决策树（回归树），并在预测时将所有树的预测结果进行平均。为了增加树之间的差异性（从而降低整体模型的方差），随机森林在两个层面引入了随机性：1. 样本随机: 对原始训练数据进行有放回的自助采样（Bootstrap Sampling），为每棵树生成不同的训练子集。2. 特征随机: 在每个决策树节点进行分裂时，不是考虑所有特征，而是随机选择一部分特征进行考察，从中选出最佳分裂特征。
  - 优点: 通常具有很高的预测精度，对过拟合不敏感，能够处理高维数据，并且可以评估特征的重要性。
- 梯度提升树 (GradientBoostingRegressor) - 集成学习（Boosting）
  - 原理: 另一种强大的集成学习算法，属于 Boosting 方法。与随机森林并行构建树不同，梯度提升是串行地、迭代地构建决策树。每一棵新树都是为了拟合前一棵树（或之前所有树组成的模型）的残差（或梯度）。模型通过逐步减小损失函数（如均方误差）来提升性能。它在函数空间中使用梯度下降的思想来优化模型。
  - 优点: 预测精度通常非常高，能够处理各种类型的特征。
  - 缺点: 对参数设置较为敏感，训练过程相对较慢（因为是串行的）。
- XGBoost (xgb.XGBRegressor) - 高效的梯度提升库
  - 原理: 梯度提升算法的高效、可扩展且带正则化的实现。它在标准梯度提升的基础上做了许多优化，包括：1. 正则化: 在损失函数中加入了 L1 和 L2 正则化项（针对叶子节点的权重），防止过拟合。2. 优化: 优化了损失函数（使用二阶泰勒展开），并采用了更智能的树分裂策略（如考虑缺失值的处理、近似分位数算法等）。3. 效率: 支持并行计算（特征级别并行）、缓存优化等，大大提高了训练速度。
  - 优点: 性能卓越，速度快，是许多数据科学竞赛中的首选算法之一。
评估指标: 主要使用以下两个指标来评估模型性能：
- 均方根误差 (RMSE - Root Mean Squared Error): 衡量预测值与真实值之间的平均差异，值越小越好。计算公式为： \[ \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2} \] 其中 \( n \) 是样本数量，\( y_i \) 是第 \( i \) 个样本的真实值，\( \hat{y}_i \) 是预测值。
- 决定系数 (R² - Coefficient of Determination): 表示模型能够解释的目标变量方差的比例，值越接近 1 越好（在线性回归中通常介于 0 和 1 之间）。计算公式为： \[ R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} = 1 - \frac{\text{RSS}}{\text{TSS}} \] 其中 \( \bar{y} \) 是目标变量的平均值，RSS 是残差平方和，TSS (Total Sum of Squares) 是总平方和。
交叉验证: 为了获得对模型泛化能力更可靠的估计，脚本对每个模型在完整的（处理后的）训练数据集上进行了 5 折交叉验证 (cross_val_score)。过程是：将训练数据分成 5 个互斥的子集（折），轮流使用其中 4 折作为训练集，剩下 1 折作为验证集，重复 5 次，最后将 5 次的评估结果（如负均方误差）平均，得到交叉验证得分。这有助于评估模型的稳定性和避免因单次训练/验证集划分带来的偶然性。
模型比较: 脚本计算了每个模型在验证集上的 RMSE 和 R²，并将这些结果打印到终端输出中。同时，生成了一个条形图用于直观比较各模型在验证集上的 RMSE，便于选出表现最佳的模型：

Data Splitting: Used sklearn.model_selection.train_test_split to split the processed training data into training and validation sets (e.g., 80% train, 20% validation) to evaluate the model's generalization ability on unseen data.
Model Selection: Trained various regression models, covering algorithms from basic to advanced, for comprehensive comparison:
- Linear Regression (LinearRegression)
  - Principle: The most basic regression algorithm, aiming to find the best-fitting straight line (or hyperplane) to describe the linear relationship between features \(X\) and the target variable \(y\). It assumes the target variable is a linear combination of features plus an error term.
  - Formula: The model form is \( \hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p \), where \( \hat{y} \) is the predicted value, \( x_1, ..., x_p \) are the features, \( \beta_0 \) is the intercept term, and \( \beta_1, ..., \beta_p \) are the coefficients corresponding to the features. The model estimates the coefficients \( \beta \) by minimizing the Residual Sum of Squares (RSS): \[ \text{RSS} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - (\beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}))^2 \]
- Ridge Regression (Ridge) - L2 Regularization
  - Principle: A regularized form of linear regression, mainly used to handle multicollinearity (high correlation) among features and prevent overfitting. It adds an L2 regularization term (sum of squared coefficients) to the linear regression loss function (RSS).
  - Formula: The optimization objective of Ridge Regression is to minimize: \[ \sum_{i=1}^{n} (y_i - (\beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}))^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \] where \( \lambda \ge 0 \) is the hyperparameter adjusting the regularization strength. The larger \( \lambda \), the more the coefficients \( \beta_j \) are shrunk towards 0, but they do not become strictly 0. This helps reduce model complexity and improve generalization.
- Lasso Regression (Lasso) - L1 Regularization
  - Principle: Another regularized form of linear regression, similar to Ridge, but using an L1 regularization term (sum of absolute values of coefficients). An important characteristic of Lasso is its ability to shrink some feature coefficients exactly to 0, thus performing feature selection.
  - Formula: The optimization objective of Lasso Regression is to minimize: \[ \sum_{i=1}^{n} (y_i - (\beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}))^2 + \lambda \sum_{j=1}^{p} |\beta_j| \] Similarly, \( \lambda \ge 0 \) is the regularization strength hyperparameter. When \( \lambda \) is sufficiently large, some unimportant feature coefficients become 0, making the model sparser and easier to interpret.
- Random Forest (RandomForestRegressor) - Ensemble Learning (Bagging)
  - Principle: A powerful ensemble learning algorithm, an extension of the Bagging (Bootstrap Aggregating) method. It builds multiple decision trees (regression trees) and averages their predictions. To increase diversity among trees (thus reducing the overall model variance), Random Forest introduces randomness at two levels: 1. Sample Randomness: Performs bootstrap sampling (sampling with replacement) on the original training data to generate different training subsets for each tree. 2. Feature Randomness: When splitting nodes in each decision tree, instead of considering all features, it randomly selects a subset of features to evaluate, choosing the best split feature from this subset.
  - Advantages: Typically achieves high prediction accuracy, is robust to overfitting, can handle high-dimensional data, and can evaluate feature importance.
- Gradient Boosting Tree (GradientBoostingRegressor) - Ensemble Learning (Boosting)
  - Principle: Another powerful ensemble learning algorithm, belonging to the Boosting method. Unlike Random Forest which builds trees in parallel, Gradient Boosting builds decision trees sequentially and iteratively. Each new tree is trained to fit the residuals (or gradients) of the previous tree (or the ensemble of all preceding trees). The model improves performance by gradually reducing the loss function (e.g., mean squared error). It uses gradient descent ideas in the function space to optimize the model.
  - Advantages: Prediction accuracy is usually very high, capable of handling various feature types.
  - Disadvantages: Sensitive to parameter settings, training process is relatively slow (due to its sequential nature).
- XGBoost (xgb.XGBRegressor) - Efficient Gradient Boosting Library
  - Principle: An efficient, scalable, and regularized implementation of the gradient boosting algorithm. It incorporates many optimizations over standard gradient boosting, including: 1. Regularization: Adds L1 and L2 regularization terms to the loss function (targeting the weights of leaf nodes) to prevent overfitting. 2. Optimization: Optimizes the loss function (using second-order Taylor expansion) and employs smarter tree splitting strategies (e.g., handling missing values, approximate quantile algorithms). 3. Efficiency: Supports parallel computation (feature-level parallelism), cache optimization, etc., significantly speeding up training.
  - Advantages: Excellent performance, fast speed, often a preferred algorithm in many data science competitions.
Evaluation Metrics: Primarily used the following two metrics to assess model performance:
- Root Mean Squared Error (RMSE): Measures the average difference between predicted and actual values; the smaller the value, the better. Formula: \[ \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2} \] where \( n \) is the number of samples, \( y_i \) is the actual value of the \( i \)-th sample, and \( \hat{y}_i \) is the predicted value.
- Coefficient of Determination (R²): Represents the proportion of the variance in the target variable that the model can explain; the closer to 1, the better (typically between 0 and 1 for linear regression). Formula: \[ R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} = 1 - \frac{\text{RSS}}{\text{TSS}} \] where \( \bar{y} \) is the mean of the target variable, RSS is the Residual Sum of Squares, and TSS (Total Sum of Squares) is the Total Sum of Squares.
Cross-Validation: To obtain a more reliable estimate of the model's generalization ability, the script performed 5-fold cross-validation (cross_val_score) on each model using the complete (processed) training dataset. The process involves: splitting the training data into 5 mutually exclusive subsets (folds), iteratively using 4 folds for training and the remaining 1 fold for validation, repeating 5 times, and finally averaging the 5 evaluation results (e.g., negative mean squared error) to get the cross-validation score. This helps assess model stability and avoids randomness due to a single train/validation split.
Model Comparison: The script calculated the RMSE and R² for each model on the validation set and printed these results to the terminal output. Additionally, a bar chart was generated to visually compare the RMSE of each model on the validation set, facilitating the selection of the best-performing model:

模型性能比较

5. 超参数调优与特征重要性

根据 exp1.py 的函数定义 (tune_hyperparameters, feature_importance)，脚本很可能包含以下步骤：

Based on the function definitions in exp1.py (tune_hyperparameters, feature_importance), the script likely includes the following steps:

超参数调优: 对于表现最好的模型（例如 XGBoost 或随机森林），可能会使用 GridSearchCV 或类似方法，结合交叉验证来寻找最优的超参数组合（如树的数量、深度、学习率等），以进一步提升模型性能。
特征重要性分析: 对于最终选定的模型（特别是树模型如随机森林、梯度提升树、XGBoost），脚本会提取并可视化特征的重要性得分：

特征重要性

这有助于理解哪些特征对房价预测贡献最大，为模型解释和未来可能的特征选择提供依据。线性模型的系数也可以用来分析特征影响。

6. 预测与结果生成

最终模型训练: 使用找到的最优超参数（如果进行了调优）在全部处理后的训练数据上重新训练选定的最佳模型。
测试集预测: 使用训练好的最终模型对处理后的测试数据集 (test_scaled_df) 进行预测，得到测试集样本的预测房价。
预测效果可视化: 为了更直观地评估最终模型的预测性能，可以将预测值与真实值（如果在验证集上评估）进行散点图对比：

预测值 vs 真实值

分析：理想情况下，散点图中的点应紧密围绕在对角线（y=x）周围，表明预测值与真实值高度一致。如果点偏离对角线较远，则表示预测误差较大。观察此图可以判断模型在不同价格区间的预测表现，是否存在系统性的高估或低估。

结果提交: 将对测试集的预测结果保存为 CSV 文件，路径为 output/submission.csv，其格式遵循 data/exp1/sample.csv 的要求。

7. 结论

本实验通过加载、探索波士顿房价数据，进行特征工程和预处理，训练、评估并比较了多种回归模型（包括线性模型和强大的集成模型如 XGBoost），最终目标是生成对测试集房价的准确预测。

通过 EDA 发现了 RM 和 LSTAT 等关键特征。特征标准化是必要的预处理步骤。模型比较显示，集成模型（如 XGBoost、随机森林、梯度提升树）通常在此类表格数据回归任务上表现优于简单的线性模型。脚本还包含了进行超参数调优和特征重要性分析的功能，以进一步优化模型并理解其预测依据。最终的预测结果被格式化并保存，可用于提交或进一步分析。

EDA revealed key features like RM and LSTAT. Feature standardization was a necessary preprocessing step. Model comparison showed that ensemble models (like XGBoost, Random Forest, Gradient Boosting Tree) generally outperformed simple linear models on such tabular data regression tasks. The script also included functionality for hyperparameter tuning and feature importance analysis to further optimize the model and understand its prediction basis. The final prediction results were formatted and saved, ready for submission or further analysis.

整个流程体现了机器学习项目的一个标准工作流，从数据理解到模型部署（预测）。

— HealthJian ✍️

参考文献

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825-2830.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 1189-1232.
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794).
Harrison, D., & Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1), 81-102.
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., ... & Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825), 357-362.
McKinney, W. (2010). Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference (Vol. 445, pp. 51-56).

附录

附录 A: 模型训练评估流程图

以下流程图展示了本实验中模型训练与评估的主要步骤：

模型训练评估流程图

摘要