MLVehiclePriceAdvisor

Intro

MLVehiclePriceAdvisor is a Python-based project designed to scrape vehicle data from the popular automotive marketplace Otomoto. The data is then processed and used to train a random forest machine learning model to predict optimal vehicle pricing based on their features.

The script extracts vehicle data from Otomoto and saves it into CSV files for each car brand.
The project processes and cleans the data, preparing it for machine learning analysis.
The random forest model predicts optimal vehicle prices based on attributes like mileage, production year, and engine power.
A console application allows users to input vehicle details and receive price recommendations.

Features

Vehicle data extraction
Data preprocessing
Random forest model
Optimal price prediction
Console application
Python-based implementation

CSV data storage
Data cleaning
Responsive predictions
User-friendly interface
Machine learning analysis

Technologies Used

Python
Selenium
Pandas

Scikit-learn
Joblib

Demo

Data Scraping

Scraping listings for a specific brand.

Checking if data has already been collected for a specific fuel type.

Skipping brands with existing CSV files to avoid duplicates.

Data Processing

Original dataset collected directly from Otomoto, containing raw and unprocessed data.

Data Merging: Combines multiple CSV files from different car brands into one dataset.
Invalid Data Removal: Removes invalid entries and non-numeric units (e.g., 'km', 'cm³', 'KM').
Price Conversion: Converts prices to numerical format by removing commas and changing to floats.
Consistency Check: Ensures all entries are formatted consistently.

Data Conversion: Converts mileage, engine power, and prices to float format for consistency.
Space Removal: Removes spaces and ensures all numeric fields are formatted correctly.
Dataset Balancing: Adjusts dataset entries to ensure a balanced representation.
Data Validation: Checks for missing or corrupted data entries.

Processed dataset, cleaned and formatted for machine learning analysis.

Model Training

The RandomForest_Training.py script trains a Random Forest model to predict vehicle prices.

Data Preparation: Loads data and converts price to numeric format.
Feature Encoding: Categorical features such as 'brand_name', 'car_model', 'gearbox', and 'fuel' are converted to numerical format using one-hot encoding.
Feature Selection: Input features X exclude the target price column, which is set as the output y.
Data Splitting: Data is split into training (80%) and testing (20%) sets to validate model performance.
Batch Training: Uses a batch size of 10,000 for efficient training on large datasets.
Model Evaluation: The trained model returned an R² score of 0.81.

Price Prediction - Code Analysis

Data Loading: The model and data are loaded from pre-saved files.
Data Preprocessing: Converts prices to float and applies one-hot encoding for categorical features.
Feature Engineering: Adjusts the predicted price based on mileage and production year.
User Input Handling: The console application prompts the user for details such as brand, model, mileage, and production year.
Data Validation: Ensures the user input data is complete and valid before proceeding with predictions.
Price Prediction: The model predicts an initial price based on the provided vehicle details.

Prediction Adjustment: The predicted price is adjusted iteratively based on the input mileage and production year.
Comparison with Market Data: Compares the predicted price with similar listings on Otomoto to ensure accuracy.
Final Output: The adjusted price is displayed to the user, providing an optimal pricing suggestion.
User Interaction: The application provides feedback based on the availability of data for the specified make and model.
Error Handling: Handles cases where the model cannot make a prediction due to insufficient data.
Data Logging: Logs the user input and predicted prices for future analysis and model improvement.

Price Prediction

Scenario: Input data matches the average range for existing data.
Optimal Price: The model predicts an optimal price based on typical data points.
User Input: Brand: Ferrari, Model: F430 F1, Mileage: 50000 km, Year: 2007.
Result: The predicted price aligns with the average market price, suggesting a fair valuation.

Scenario: Input data exceeds typical range, representing a newer model year.
Price Adjustment: The model adjusts the predicted price to reflect higher value for newer models.
User Input: Brand: Ferrari, Model: F430 F1, Mileage: 50000 km, Year: 2020.
Result: The predicted price is higher due to the newer model year, providing a realistic market estimation.

Source Code

You can view the source code: HERE

Back to Portfolio