二手车交易价格预测赛题：从数据处理到模型调优全流程揭秘

赛题思路和程序实现详细步骤 1. 数据预处理数据预处理是机器学习项目中的关键步骤，包括数据清洗、特征工程和数据转换。 # 1.1 数据加载和初步检查 ```python import pandas as pd # 加载数据 train_df = pd.read_csv('train.csv') test_a_df = pd.read_csv('test_a.csv') test_b_df = pd.read_csv('test_b.csv') # 查看数据基本信息 print(train_df.info()) print(train_df.describe()) print(train_df.head()) ``` # 1.2 处理缺失值 ```python # 检查缺失值 print(train_df.isnull().sum()) # 处理缺失值 # 对于数值型特征，可以用中位数或均值填充 train_df['power'].fillna(train_df['power'].median(), inplace=True) train_df['kilometer'].fillna(train_df['kilometer'].median(), inplace=True) # 对于类别型特征，可以用众数填充 train_df['notrepaireddamage'].fillna(train_df['notrepaireddamage'].mode()[0], inplace=True) # 对于时间特征，可以用最常见的时间填充 train_df['regdate'].fillna(train_df['regdate'].mode()[0], inplace=True) # 对于匿名变量，可以考虑用中位数或众数填充 for col in train_df.columns: if train_df[col].isnull().sum() > 0 and col not in ['name', 'model', 'brand', 'regioncode']: if train_df[col].dtype == 'object': train_df[col].fillna(train_df[col].mode()[0], inplace=True) else: train_df[col].fillna(train_df[col].median(), inplace=True) ``` # 1.3 特征工程 ```python # 提取注册日期特征 train_df['regdate_year'] = train_df['regdate'].apply(lambda x: int(str(x)[:4])) train_df['regdate_month'] = train_df['regdate'].apply(lambda x: int(str(x)[4:6])) # 转换类别型特征 train_df['notrepaireddamage'] = train_df['notrepaireddamage'].map({'yes': 0, 'no': 1}) # 对匿名变量进行编码 from sklearn.preprocessing import LabelEncoder label_encoder = LabelEncoder() for col in train_df.columns: if train_df[col].dtype == 'object' and col not in ['name', 'model', 'brand', 'regioncode']: train_df[col] = label_encoder.fit_transform(train_df[col]) ``` 2. 模型选择和训练选择合适的模型进行训练，并使用交叉验证评估模型性能。 # 2.1 划分训练集和验证集 ```python from sklearn.model_selection import train_test_split X = train_df.drop(['price'], axis=1) y = train_df['price'] X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42) ``` # 2.2 选择模型可以选择多种模型进行尝试，如线性回归、决策树、随机森林、XGBoost等。 ```python from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor from xgboost import XGBRegressor # 线性回归 linear_reg = LinearRegression() linear_reg.fit(X_train, y_train) linear_pred = linear_reg.predict(X_val) # 随机森林 rf_reg = RandomForestRegressor(n_estimators=100, random_state=42) rf_reg.fit(X_train, y_train) rf_pred = rf_reg.predict(X_val) # XGBoost xgb_reg = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42) xgb_reg.fit(X_train, y_train) xgb_pred = xgb_reg.predict(X_val) ``` # 2.3 评估模型 ```python from sklearn.metrics import mean_squared_error # 计算均方根误差 linear_rmse = mean_squared_error(y_val, linear_pred, squared=False) rf_rmse = mean_squared_error(y_val, rf_pred, squared=False) xgb_rmse = mean_squared_error(y_val, xgb_pred, squared=False) print(f'Linear Regression RMSE: {linear_rmse}') print(f'Random Forest RMSE: {rf_rmse}') print(f'XGBoost RMSE: {xgb_rmse}') ``` 3. 模型调优使用网格搜索或随机搜索进行超参数调优。 ```python from sklearn.model_selection import GridSearchCV # 定义参数网格 param_grid = { 'n_estimators': [100, 200, 300], 'learning_rate': [0.01, 0.1, 0.2], 'max_depth': [3, 5, 7] } # 创建GridSearchCV对象 grid_search = GridSearchCV(estimator=xgb_reg, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1) # 拟合模型 grid_search.fit(X_train, y_train) # 获取最佳参数 best_params = grid_search.best_params_ print(f'Best Parameters: {best_params}') # 使用最佳参数重新训练模型 best_xgb_reg = XGBRegressor(best_params) best_xgb_reg.fit(X_train, y_train) best_xgb_pred = best_xgb_reg.predict(X_val) # 评估最佳模型 best_xgb_rmse = mean_squared_error(y_val, best_xgb_pred, squared=False) print(f'Best XGBoost RMSE: {best_xgb_rmse}') ``` 4. 模型预测和提交使用最佳模型对测试集进行预测，并生成提交文件。 ```python # 对测试集A进行预测 test_a_df['regdate_year'] = test_a_df['regdate'].apply(lambda x: int(str(x)[:4])) test_a_df['regdate_month'] = test_a_df['regdate'].apply(lambda x: int(str(x)[4:6])) test_a_df['notrepaireddamage'] = test_a_df['notrepaireddamage'].map({'yes': 0, 'no': 1}) for col in test_a_df.columns: if test_a_df[col].dtype == 'object' and col not in ['name', 'model', 'brand', 'regioncode']: test_a_df[col] = label_encoder.transform(test_a_df[col]) test_a_pred = best_xgb_reg.predict(test_a_df) # 生成提交文件 submission_a = pd.DataFrame({ 'SaleID': test_a_df['SaleID'], 'price': test_a_pred }) submission_a.to_csv('submission_a.csv', index=False) # 对测试集B进行预测 test_b_df['regdate_year'] = test_b_df['regdate'].apply(lambda x: int(str(x)[:4])) test_b_df['regdate_month'] = test_b_df['regdate'].apply(lambda x: int(str(x)[4:6])) test_b_df['notrepaireddamage'] = test_b_df['notrepaireddamage'].map({'yes': 0, 'no': 1}) for col in test_b_df.columns: if test_b_df[col].dtype == 'object' and col not in ['name', 'model', 'brand', 'regioncode']: test_b_df[col] = label_encoder.transform(test_b_df[col]) test_b_pred = best_xgb_reg.predict(test_b_df) # 生成提交文件 submission_b = pd.DataFrame({ 'SaleID': test_b_df['SaleID'], 'price': test_b_pred }) submission_b.to_csv('submission_b.csv', index=False) ``` 以上是预测二手车交易价格的详细步骤，包括数据预处理、模型选择和训练、模型调优以及最终的预测和提交。希望这些步骤能帮助你完成比赛任务。 ######[AI写代码神器 | 1736点数解答 | 2024-11-17 18:50:36]

服务商

更多选项

快捷项

自定义