menu
 3,091

House Prices - Advanced Regression Techniques

Predict sales prices and practice feature engineering, RFs, and gradient boosting.
stage:
Submission Deadline
prize:
Knowledge
more
Summary
Timeline
Forum
Teams2
FAQ
Summary

Overview

Start here if...

You have some experience with R or Python and machine learning basics. This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition. 

 

Competition Description

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

 

Practice Skills

  • Creative feature engineering
  • Advanced regression techniques like random forest and gradient boosting

 

Acknowledgments

The Ames Housing dataset was compiled by Dean DeCock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. 

Photo by Tom Thain on Unsplash.


Guidelines

RULES

One account per participant

You cannot sign up to Kaggle from multiple accounts and therefore you cannot submit from multiple accounts.

No private sharing outside teams

Privately sharing code or data outside of teams is not permitted. It's okay to share code if made available to all participants on the forums.

Team Mergers

Team mergers are allowed and can be performed by the team leader. In order to merge, the combined team must have a total submission count less than or equal to the maximum allowed as of the merge date. The maximum allowed is the number of submissions per day multiplied by the number of days the competition has been running.

Team Limits

There is no maximum team size.

Submission Limits

You may submit a maximum of 5 entries per day.

You may select up to 2 final submissions for judging.

Competition Timeline

Start Date: 8/30/2016 1:08 AM UTC

Merger Deadline: None

Entry Deadline: None

End Date: None

  • Due to the public nature of the data, this competition does not count towards Kaggle ranking points.
  • We ask that you respect the spirit of the competition and do not cheat. Hand-labeling is forbidden.

 

Evaluation

Goal

It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. 

Metric

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

Submission File Format

The file should contain a header and have the following format:

Id,SalePrice

1461,169000.1

1462,187724.1233

1463,175221

etc.

You can download an example submission file (sample_submission.csv) on the Data page.

 

Tutorials

Kaggle Learn

Kaggle Learn offers hands-on courses for most data science topics. These short courses prepare you with the key ideas to build your own projects.

The Machine Learning Course will give you everything you need to succeed in this competition and others like it.

Other R Tutorials

Fun with Real Estate Data

  • Use Rmarkdown to learn advanced regression techniques like random forests and XGBoost

XGBoost with Parameter Tuning

  • Implement LASSO regression to avoid multicollinearity
  • Includes linear regression, random forest, and XGBoost models as well

Ensemble Modeling: Stack Model Example

  • Use "ensembling" to combine the predictions of several models
  • Includes GBM (gradient boosting machine), XGBoost, ranger, and neural net using the caret package

A Clear Example of Overfitting

  • Learn about the dreaded consequences of overfitting data

Other Python Tutorials

Comprehensive Data Exploration with Python

  • Understand how variables are distributed and how they interact
  • Apply different transformations before training machine learning models

House Prices EDA

  • Learn to use visualization techniques to study missing data and distributions
  • Includes correlation heatmaps, pairplots, and t-SNE to help inform appropriate inputs to a linear model

A Study on Regression Applied to the Ames Dataset

  • Demonstrate effective tactics for feature engineering
  • Explore linear regression with different regularization methods including ridge, LASSO, and ElasticNet using scikit-learn

Regularized Linear Models

  • Build a basic linear model
  • Try more advanced algorithms including XGBoost and neural nets using Keras
Timeline
Forum
Teams2
FAQ