Given the increasingly volatile state of the stock market due to the coronavirus and investor and media reaction, traditional automated trading strategies that look exclusively at technical indicators may fail. I wanted to see if I could create a ML model that accurately determines how the market will move on any given day in its current state.
What it does
StockPredictor includes two models:
1] Short-term MLR Model -- a Multiple Linear Regression Model that can predict whether a given stock will increase or decrease over the course of a day and by how much. This model is trained on data from the past 60 days and lightweight - it can be trained to learn the model any stock's behavior within less than 20 seconds. Remarkably, the model's R^2 value ranged from 0.80-0.90 for a number of common stocks, meaning that it explained a large amount of the variance in change in stock price over time and fit the observed data well. The features that this model is trained on are daily news sentiment, technical indicators (e.g. simple moving average, volume, etc.), and coronavirus statistics.
2] Long-term AutoML Model -- a model trained on 20 years of historical data that predicts which way the Dow will move in the next 24 hours. This model was created using Google's Cloud AutoML Tables API. It has ~70% accuracy meaning that the model predicts which way the Dow will move correctly almost 70 percent of the time. It also has an AUROC of over .75, indicating strong discriminatory power. It's trained exclusively on technical indicators.
Both these models together can help supplement investors in making decisions about when to buy, sell, or hold stock. StockPredictor can currently be deployed using a Flask server. Users can make API calls to make predictions or utilize a provided web interface.
How I built it
First, I created a VM Instance on Google Cloud and used DataLab to experiment with different APIs and create an initial version of my models. I collected data for my models from a few different sources:
- I found aggregate coronavirus statistics on Kaggle and uploaded it to a Bucket for use in DataLab, learning about I/O in DataLab along the way. The coronavirus statistics were preprocessed to reflect percent change in deaths, confirmed cases, and recovered patients rather than cumulative totals. The thinking behind this was that large jumps in coronavirus deaths or cases would negatively influence investor perception and subsequently impact the market.
- I used the Google News API to query for daily headlines related to given stocks. I then used the Google Cloud Natural Language API to analyze the overall sentiment of daily headlines, generating a dataset of news sentiment scores for each day.
- I gathered all historical stock data, including technical indicators and daily stock quotes, using the open-source Alphavantage API. The technical indicators I chose to use as features were RSI, MACD, SMA, Middle / Upper / Lower Bollinger Bands, and Volume. My target variable was daily percent change in stock price close. Rather than attempting to forecast stock price, my algorithms simply make a judgement _ on whether a stock will move up or down and to what extent _.
After cleaning up the data by merging dataframes, normalizing columns to have values between -1 and 1, and removing rows with missing data, I used statsmodels to fit a multiple linear regression model to change in stock price based on the aforementioned features (coronavirus statistics, technical indicators, and news sentiment scores). Using statsmodels I also determined which features were correlated to each other, which were statistically significant (p < .1), and which had little impact on the model's performance. By doing this I was able to remove unnecessary features and construct a better generalized models. Specifically, I aimed to maximize the model's F-statistic -- a higher F-statistic indicates the model performs better than a model with no independent variables.
I also wanted to look at how effective long-term historical data was at predicting the general movement of the stock market within the last few months. I used Alphavantage to gather technical indicators and stock quotes, specifically for the Dow, over the past 20 years. After cleaning and normalizing this data I used the Google Cloud AutoML Tables API to create a binary classification model that predicts whether the Dow will move up or down on any given day from the aforementioned dataset. The model took approximately an hour to train but yielded good results! Once the model was trained I deployed it for online use and made requests to it from python client libraries.
Finally, after I felt comfortable with all my work on Datalab, I exported my notebook to a Bucket, downloaded it locally, and started creating a Flask server to host the ML models. The Flask server includes an API that users can make POST requests to and a web interface, both returning model predictions. To style my application, I used Materialize.
The features with the smallest p-values in the Multiple Linear Regression Model were sentiment score, MACD, and RSI, all p<0.05. Additionally, percent change in coronavirus confirmed cases had a low p-value, indicating _ strong correlation _ with the target variable. This was an interesting finding as the changes in coronavirus deaths or recoveries didn't seem to have any statistically significant impact on predictions.
Another interesting finding was that the R^2 values were high (0.80-0.90) for the MLR model even with a relatively small sample size of 60 observations. This could be either because a) the large number of features increased the dimensionality of the model, making it easier to fit b) it's reflective of a real trend in the market.
Challenges I ran into
There's a few challenges I ran into --
- Using the Google Cloud Client Libraries: I'd never used client libraries for the new AutoML Tables API and kept getting a ContextualVersionConflict Error. I eventually resolved this by uninstalling and reinstalling google-api-core libraries.
- Service Account Keys: I accidentally pushed my service account key to GitHub. This made Google Cloud temporarily suspend my account because it violated their terms of service. Luckily I fixed the issue quickly, filed an appeal, and had my account reinstated in less than an hour.
- Downloading files from DataLab: I had a tough time moving files between my local machine and DataLab, but after reading about Buckets more I was able to resolve the problem.
What I learned
I learned a lot about Google Cloud, namely how to use different client libraries, Buckets, and AutoML. I also learned a little bit of Flask and how to integrate it with a machine learning backend!
For the future, I'd want to expand the functionality of my API and make it more usable for intraday or swing traders. Further, I'd want to experiment with more features to see if I could improve the performance of my MLR model and build my own architecture to try and outperform AutoML. Overall, I'm proud of finishing this project and getting both models to have great performance!
Try It out
css, flask, google-automl, google-cloud, google-compute-engine, google-nlp, html, jupyter-notebook, python, scikit-learn