Model.py
Introduction
The model.py
file (opens in a new tab) in basic-coin-prediction-node
consists of several key components:
- Imports and Configuration: Sets up necessary libraries and configuration variables.
- Paths Configuration: Generates paths for storing data dynamically based on coin symbols.
- Downloading Data: Downloads historical price data for the specified symbols, intervals, years, and months.
- Formatting Data: Reads, formats, and saves the downloaded data as CSV files.
- Training the Model: Trains a linear regression model on the formatted price data and saves the trained model.
While the import and path configuration processes are straightforward, downloading and formatting the data, as well as training the model, require specific steps.
This documentation will guide you through creating models for different coins, making it easy to extend the script for general-purpose use.
Downloading the Data
The download_data
(opens in a new tab) function is designed to automate the process of downloading historical market data from Binance, a popular cryptocurrency exchange.
This function focuses on fetching data for a specified set of symbols (in this case, the trading pair "ETHUSDT"
) across various time intervals and storing them in a defined directory.
How to Use for Downloading Data of Any Coin
Update the Symbols List
Replace ["ETHUSDT"]
with the desired trading pair(s), e.g., ["BTCUSDT", "LTCUSDT"]
.
Adjust Time Intervals
Modify the intervals list if you need different time intervals. Binance supports various intervals like ["1m", "5m", "1h", "1d", "1w", "1M"]
.
Extend Date Ranges
Update the years and months lists to match the historical range you need.
Define the Download Path
Ensure binance_data_path
is set to the directory where you want the data to be saved.
Here’s a quick example of how to adjust the script for downloading data for multiple trading pairs:
def download_data():
cm_or_um = "um"
symbols = ["BTCUSDT", "LTCUSDT"] # Updated symbols
intervals = ["1d"]
years = ["2020", "2021", "2022", "2023", "2024"]
months = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]
download_path = binance_data_path
download_binance_monthly_data(
cm_or_um, symbols, intervals, years, months, download_path
)
print(f"Downloaded monthly data to {download_path}.")
current_datetime = datetime.now()
current_year = current_datetime.year
current_month = current_datetime.month
download_binance_daily_data(
cm_or_um, symbols, intervals, current_year, current_month, download_path
)
print(f"Downloaded daily data to {download_path}.")
Formatting the Data
The format_data
(opens in a new tab) function processes raw data files downloaded from Binance, transforming them into a consistent format for analysis. Here are the key steps:
-
File Handling:
- Lists and sorts all files in the
binance_data_path
directory. - Exits if no files are found.
- Lists and sorts all files in the
-
Initialize DataFrame:
- An empty DataFrame
price_df
is created to store the combined data.
- An empty DataFrame
-
Process Each File:
- Filters for
.zip
files and reads the contained CSV file. - Retains the first 11 columns and renames them to:
["start_time", "open", "high", "low", "close", "volume", "end_time", "volume_usd", "n_trades", "taker_volume", "taker_volume_usd"]
. - Sets the DataFrame index to the
end_time
column, converted to a timestamp.
- Filters for
-
Concatenate Data:
- Combines data from each file into the
price_df
DataFrame.
- Combines data from each file into the
-
Sort and Save:
- Sorts the final DataFrame by date and saves it to
training_price_data_path
.
- Sorts the final DataFrame by date and saves it to
Column Descriptions
- start_time: The start of the trading period.
- open: Opening price.
- high: Highest price during the period.
- low: Lowest price during the period.
- close: Closing price.
- volume: Trading volume.
- end_time: End of the trading period.
- volume_usd: Trading volume in USD.
- n_trades: Number of trades.
- taker_volume: Taker buy volume.
- taker_volume_usd: Taker buy volume in USD.
This function consolidates and formats the historical price data, making it ready for analysis or machine learning tasks.
Training the Model
The train_model
(opens in a new tab) function trains a linear regression model using historical price data and saves the trained model to a file. Here's a breakdown of the process:
-
Load the Data:
- Reads the price data from a CSV file specified by
training_price_data_path
.
- Reads the price data from a CSV file specified by
-
Prepare the DataFrame:
- Converts the
date
column to a timestamp and stores it as a numerical value. - Computes the average price using the
open
,close
,high
, andlow
columns.
- Converts the
-
Reshape Data for Regression:
- Extracts the
date
column as the feature (x
) and the computed average price as the target (y
). - Reshapes these arrays to the format expected by
scikit-learn
.
- Extracts the
-
Split the Data:
- Splits the data into a training set and a test set using an 80/20 split. However, the test set is not used further in this function.
-
Train the Model:
- Initializes and trains a
LinearRegression
model using the training data.
- Initializes and trains a
-
Save the Model:
- Creates the directory for the model file if it doesn't exist.
- Saves the trained model to a file specified by
model_file_path
usingpickle
.
-
Print Confirmation:
- Prints a message indicating that the trained model has been saved.
Modifying the Function for Different Models
Understanding the Target Variable in Regression Models
The target variable (y) in regression models is a critical component that determines the type of analysis and predictions that can be performed. In this context, the target variable represents continuous data—in this case, the average of financial metrics such as the open, close, high, and low prices over time. The y-axis points on the regression graph correspond to these continuous values, which can take on any numerical value within a defined range.
To change the model used for training, replace the LinearRegression
model with another machine learning algorithm. Here is an example:
Using Polynomial Regression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
import os
import pickle
import pandas as pd
from sklearn.model_selection import train_test_split
def train_model():
# Load the eth price data
price_data = pd.read_csv(training_price_data_path)
df = pd.DataFrame()
# Convert 'date' to a numerical value (timestamp) we can use for regression
df["date"] = pd.to_datetime(price_data["date"])
df["date"] = df["date"].map(pd.Timestamp.timestamp)
# Calculate the mean price as the target variable
df["price"] = price_data[["open", "close", "high", "low"]].mean(axis=1)
# Reshape the data to the shape expected by sklearn
x = df["date"].values.reshape(-1, 1)
y = df["price"].values.reshape(-1, 1)
# Split the data into training set and test set
x_train, _, y_train, _ = train_test_split(x, y, test_size=0.2, random_state=0)
# Create a pipeline that first transforms the input data to polynomial features and then fits a linear model
model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
model.fit(x_train, y_train)
# Create the model's parent directory if it doesn't exist
os.makedirs(os.path.dirname(model_file_path), exist_ok=True)
# Save the trained model to a file
with open(model_file_path, "wb") as f:
pickle.dump(model, f)
print(f"Trained polynomial regression model saved to {model_file_path}")