Stories by Ryo Koyajima / 小矢島諒 on Medium

Stone Soup and Data Science

Ryo Koyajima / 小矢島諒 — Sun, 20 Nov 2022 14:15:29 GMT

I've worked as a Data Scientist for several years and cooperated with non-technical people. Those experiences remind me that "Stone Soup," a folk story, has a profound insight into working as a Data Scientist. Let me share this story and my thought.

For the sake of a delicious meal

For those who don't know this story, here is the story from Wikipedia.

Some travelers come to a village, carrying nothing more than an empty cooking pot. Upon their arrival, the villagers are unwilling to share any of their food stores with the very hungry travelers. Then the travelers go to a stream and fill the pot with water, drop a large stone in it, and place it over a fire. One of the villagers becomes curious and asks what they are doing. The travelers answer that they are making “stone soup”, which tastes wonderful and which they would be delighted to share with the villager, although it still needs a little bit of garnish, which they are missing, to improve the flavor.

The villager, who anticipates enjoying a share of the soup, does not mind parting with a few carrots, so these are added to the soup. Another villager walks by, inquiring about the pot, and the travelers again mention their stone soup which has not yet reached its full potential. More and more villagers walk by, each adding another ingredient, like potatoes, onions, cabbages, peas, celery, tomatoes, sweetcorn, meat (like chicken, pork and beef), milk, butter, salt and pepper. Finally, the stone (being inedible) is removed from the pot, and a delicious and nourishing pot of soup is enjoyed by travelers and villagers alike. Although the travelers have thus tricked the villagers into sharing their food with them, they have successfully transformed it into a tasty meal which they share with the donors.

This story has several variations, yet we can find some insight.

Villagers have enough resources to make delicious food, but no one can be aware of the possibility.
Travelers have the ability to create a soup, but it won't help if they don't have enough resources.
Delicious soup can be made only if villagers and travelers work together.

Build a fellowship

During our data science project, we often face various hurdles to achieving our goal. It is sometimes a technical matter and sometimes a business matter. Especially for a business, we may have a chance to work with non-technical people and possibly need to convince them. It may become laborious and time-consuming if they are unwilling to work with you proactively.

At such times, we can remember what travelers did to make a great soup. Travelers showed the possibility to villagers by demonstrating with a stone and water. They convinced each villager and led to achieve the big picture.

In the same way, we data scientists can build a fellowship with non-technical people to accomplish our goals together. We ask about their most painful issue to clarify our goal. We create a prototype for them to imagine how beneficial it is. Then we can convince them to extract data or provide domain-specific information like the soup's ingredients to make the product more meaningful. The more elements we can add to our product, the more people we can cooperate with in the same direction, and the faster we can move the project forward.

Conclusion

In the end, it has long been said that machine learning needs to be implemented in society. Still, it depends on how we can translate AI technology to business value for non-technical people and how valuable the big picture we can draw for them is.

As an aside, I saw this folk tale in the book "Pragmatic Programmer." This book is for software engineers, but it's full of insight so I would recommend it to data scientists.

Stable Diffusion Quickstart withWSL2 and RTX3070

Ryo Koyajima / 小矢島諒 — Mon, 19 Sep 2022 17:00:25 GMT

Stable Diffusion Quickstart with WSL2 and RTX3070

~Generate Boss Baby-ish Profile Image ~

Objective

To generate my profile image on Twitter.

Unlike LinkedIn or Facebook, Twitter is a bit anonymous service, so I don’t want to use my photo as a profile image. Therefore, I’ve sought a nice picture to use my SNS icon.

Pre-requisites

Windows 10 Home 21H2
WSL2 Ubuntu 20.04.5 LTS

# run on wsl to show the version
lsb_release -a

Kernel: Linux version 5.10.102.1-microsoft-standard-WSL2

# run on cmd to show the version
wsl cat /proc/version

CPU: Ryzen 5 5600X
GPU: GeForce RTX 3070
RAM: 32GB
VRAM: 8GB
Package management: Anaconda
Not using Docker
Use optimized stable diffusion due to VRAM limitation

Quick Guide

Install WSL2 and update the latest version

https://learn.microsoft.com/ja-jp/windows/wsl/install

# run on cmd
wsl --install
wsl --update
wsl --install -d Ubuntu-20.04

2. Install CUDA Toolkit

https://learn.microsoft.com/ja-jp/windows/ai/directml/gpu-cuda-in-wsl

https://docs.nvidia.com/cuda/wsl-user-guide/index.html#getting-started-with-cuda-on-wsl

# run on wsl
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-wsl-ubuntu-11-7-local_11.7.0-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-11-7-local_11.7.0-1_amd64.deb
sudo apt-get update
sudo apt-get -y install cuda

3. Clone Stable Diffusion (Optimized one)

https://github.com/basujindal/stable-diffusion

# run on wsl
git clone git@github.com:basujindal/stable-diffusion.git
cd stable-diffusion

4. Download model “sd-v1–4.ckpt”

https://huggingface.co/CompVis/stable-diffusion-v-1-4-original

5. Rename and move the model

# run on wsl
mkdir -p models/ldm/stable-diffusion-v1
mv sd-v1–4.ckpt models/ldm/stable-diffusion-v1/model.ckpt

6. Install Anaconda

# run on wsl
wget https://repo.anaconda.com/miniconda/Miniconda3-py38_4.12.0-Linux-x86_64.sh
bash Miniconda3-py38_4.12.0-Linux-x86_64.sh

7. Install Python Packages

# run on wsl
conda env create -f environment.yaml
conda activate ldm

8. Prepare image

# run on wsl
mkdir img
mv [path to your image file] img/001.jpg
# image file name and path are depend on you. whatever is okay.

9. Run Script

# run on wsl
python optimizedSD/optimized_img2img.py --prompt "boss baby" --init-img img/001.jpg --strength 0.8 --n_iter 10 --n_samples 10 --H 512 --W 512

Parameters

prompt: text you want to combine
init-img: the image you want to combine
n_samples: number of images generated

Lastly, check the stable-diffusion/outputs/img2img-samples/boss_baby directory.

My new SNS icon

I hope this helps!

*2022/11/6: modify some commands according to a comment.

How to Deploy Your Jupyter Notebook As a Dashboard: A use case of visualizing stock data with AWS

Ryo Koyajima / 小矢島諒 — Sun, 28 Aug 2022 12:39:16 GMT

~ Build an automated dashboarding system with AWS SageMaker, GitHub Actions, ECR, App Runner, and Mercury ~

Transform Jupyter Notebook into Dashboard

Introduction

Jupyter Notebook is one of the vital tools for data scientists. However, there are still difficulties in collaborating with your teammates and sharing your notebooks with your stakeholders.

I'll introduce an automated dashboarding system using AWS SageMaker and Mercury.

SageMaker enables us to edit notebooks in a hosted environment in AWS. It's easy to share your notebooks with your teammates, and no worries about building Jupyter server staff.

Mercury is a Python library that can transform Jupyter Notebooks into dashboards. You can easily publish your dashboard to your counterparts by using this library.

Architecture

Automated Dashboarding System Architecture

For quick understanding, here is the workflow of the whole architecture.

Developers edit Jupyter Notebooks in Amazon SageMaker.
Developers push their commits to GitHub.
GitHub detects the push and automatically runs GitHub Actions, a CI/CD.
During GitHub Actions, a Docker image is built and pushed to Amazon ECR, a container registry in AWS.
As soon as the Docker image is pushed to ECR, it will be deployed to Amazon App Runner immediately.
App Runner hosted Mercury server so that end users can access the dashboard.

Here are the service or software used in this article.

Mercury
Amazon App Runner
Amazon SageMaker
GitHub Actions
Amazon ECR(Elastic Container Registry)
Amazon S3
Docker

A use case for visualizing stock data

Here is a use case of this dashboarding system which visualizes the S&P 500 ETF data.

Source code: https://github.com/koyaaarr/invest-analytics-aws

Store the stock data in S3

First, you need to store the data you want to visualize. There are several options, and I chose S3 in this use case.

stored stock data in S3

Edit a Jupyter Notebook in SageMaker

It's straightforward to use Jupyter Notebook in SageMaker, even if you're new to this like me.

You can launch the SageMaker Studio service and create a new notebook as you do in your local host.

Create a new notebook in SageMaker

Visualize the moving average and MACD of the S&P500 ETF

This is analyzing part, and I visualize some financial indicators. The analysis part is the most critical part of the actual use case, but I skip a detailed explanation since this is out of the scope of this article.

Visualize the MACD of the ETF

Add widgets for Mercury

To add widgets to the dashboard, you must add a particular cell on top of the notebook. You can check the official documentation if you want to know the detail.

https://github.com/mljar/mercury

Config of the widgets

I lay out some slider widgets in this dashboard, and the configuration will be generated like this.

Actual widgets

Commit your work

You can easily git commit your work in this tab.

Push your commit to GitHub

Once you finish your work, you can push your work to your GitHub. You must generate an access token in GitHub to git push the commits to the remote repository.

https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token

Create Dockerfile and requirements.txt

In lines 5 to 12, I install the "TA-LIB" library for stock analysis, so you can skip this part if you don't need it.

Create ECR private repository

Before creating GitHub Actions, you need to create an ECR repository. The name of your repository will be used in GitHub Actions.

Create yaml for GitHub actions

This configuration is set to execute when the master branch is changed. You can check the official documentation if you want to know the detail. Don't forget to set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_ECR_REPO_NAMEas the environmental variables in your repository. You might need to create an IAM account with appropriate privileges to generate an access key.

https://github.com/aws-actions/amazon-ecr-login

Create App runner

Lastly, let's create an App Runner instance to host your application. The deployment will be started once you set up the service's configuration. It takes several minutes.

Access Dashboard

After all of the work is finished, you can access your dashboard by clicking the link described in your App Runner service.

You can set VPC and security groups to set access control.

Conclusion

I introduced an automated dashboarding system using SageMaker and Mercury. In addition, GitHub Actions can realize CI/CD, so your code change will automatically be reflected on the dashboard. I hope this article helps.

Python Streamlit in Practice; A Use-Case of Visualizing Stock Data

Ryo Koyajima / 小矢島諒 — Mon, 27 Jun 2022 17:11:36 GMT

~How to create an ETF portfolio simulator in Python Streamlit~

https://koyaaarr-invest-analytics-ui-app-blie7w.streamlitapp.com/

Streamlit is becoming one of the great options for creating demos only using Python. I will explain how to create an effective dashboard using Streamlit and deploy Streamlit Cloud with actual stock data.

Quick Demo

Here is my demo of the ETF simulator deployed in Streamlit Cloud.

https://koyaaarr-invest-analytics-ui-app-blie7w.streamlitapp.com/

The source code is here.

https://github.com/koyaaarr/invest-analytics-ui

Let me explain how to build the dashboard one by one.

Development Environment

Before that, I want to mention the development environment. Preparing an organized environment is essential to develop faster and more steadily.

I recommend using Visual Studio Code for coding and Poetry for managing Python and its libraries. I wrote an article about building the environment, so please take a look if you want.

https://medium.com/codex/python-development-setup-for-data-scientists-2022-7f80b2018402

Data Processing

First, we must process raw stock data into the appropriate format to visualize them.

For instance, you need to calculate portfolio value by multiplying each ETF’s value and the quantity you have.

I will visualize this information in the dashboard, so each graph needs correct format data.

Time-series change of overall portfolio value
Time-series change of the Sharpe ratio
Each stock’s ratio in my portfolio

I don’t explain the data processing part in detail. To see this part, you can check my Jupyter Notebook in my source code.

https://github.com/koyaaarr/invest-analytics-ui/blob/master/notebooks/quick_look.ipynb

Visualization

Once you prepare the data, let’s visualize each of them. I use Plotly, a good-looking graph library for Python.

Time-series change of overall portfolio value

It’s straightforward to plot the line chart in Plotly. You can do that with two lines of code (except import expression).

# you can run this code in Jupyter Notebook
# stocks: dataframe contains daily stock value
# Date: date like '2022-06-01'
# Close_Portfolio: calculated value of portfolio on the date

import plotly.express as px
fig = px.line(stocks, x="Date", y="Close_Portfolio")
fig.show()

plot portfolio

There is a blank space in 2018 because these values in this period are null. Thus, drop these rows to plot only valid values. In addition, I don’t need grid lines in the graph so omit them.

fig = px.line(stocks.dropna(subset=['Close_Portfolio']), x="Date", y="Close_Portfolio")
fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)
fig.show()

plot portfolio (improved)

This graph is much better than the former one.

Time-series change of the Sharpe ratio

You can plot the Sharpe ratio the same way with the portfolio. It is said that your portfolio is good if the Sharpe ratio is greater than 1. Therefore, add a baseline to the chart.

# you can run this code in Jupyter Notebook
# sharpe: dataframe contains daily sharpe ratio value
# Date: date like '2022-06-01'
# sharpe_ratio_annual: calculated value of sharpe ratio on the date

fig = px.line(sharpe.dropna(subset=['sharpe_ratio_annual']), x="Date", y="sharpe_ratio_annual")
fig.add_hline(1, line_color="red")
fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)
fig.show()

plot Sharpe ratio

Looks good!

Each stock’s ratio in my portfolio

It’s good to use a pie chart to see the ratio of my portfolio. However, I want to see the portion of each asset’s type and the balance. The diversity of the asset type(e.g., stock, bond, commodity) is vital for our portfolio management. Therefore, I will use a sunburst chart this time.

# you can run this code in Jupyter Notebook
# ratio: dataframe contains tickers and those ratio
# type: asset type like 'stock' or 'bond'
# ticker: asset name like 'VOO' or 'BTC-USD'
# ratio_percent: each ticker's ratio [percent]

fig = px.sunburst(
ratio,
path=["type", "ticker"],
values="ratio_percent",
title="Portfolio Recent Value Ratio"
)
fig.show()

the plot ratio of each asset in my portfolio

Now we can see each asset’s ratio and type of asset’s ratio.

Organize Dashboard

Each graph is prepared now, so let’s place them on the dashboard. I will use Streamlit to create a dashboard. This dashboard consists of the following functions.

Load initial data
Process data
Visualize graphs
Arrange components

Load initial data

Load our data generated in the processing part as initial data.

@st.cache can preserve the result of the function so that you don’t have to load this function every time. Streamlit executes the whole program, so we should use these features to reduce the cost of a re-run.

@st.cache
def read_stock_data_from_local():
  stocks = pd.read_pickle("data/stocks.pkl")
  ratio = pd.read_pickle("data/ratio.pkl")
  sharpe = pd.read_pickle("data/sharpe.pkl")
return sharpe, stocks, ratio

Process data

Processing data is needed when we push the calculate button. Time-series data of the portfolio and its Sharpe ratio and the ratio of each asset will be calculated.

This component’s code is complicated, so I describe the whole program. Here is the primary process of this function.

# num_holds(dict): assets and each number of holds
# stocks(dataframe): daily close value of each asset
# ratio(dataframe): each asset ratio in portfolio at the most recent date
# sharpe(dataframe): daily sharpe ratio
# portfolio(dict): detail of each asset

def calc_stock(num_holds, stocks, ratio, portfolio):
  # calc portfolio value
  stocks["Close_Portfolio"] = stocks.apply(
lambda x: calc_portfolio(x, num_holds), axis=1)
  ~~

  # calc sharpe ratio
  sharpe = stocks.loc[:, ["Date", "Close_Portfolio"]]
  ~~~

  # calc recent value ratio
  ratio = pd.DataFrame(data={"ticker": portfolio["ticker"].keys(), "ratio_percent": recent_values})
  ~~~

return sharpe, stocks, ratio

Visualize graphs

Visualize three graphs; portfolio, Sharpe ratio, and the ratio of assets.

Streamlit can use charts generated by Plotly only using st.plotly_chart function.

# plot sharpe ratio
fig = px.line(sharpe, x="Date", y="sharpe_ratio_annual")
fig.add_hline(1, line_color="red")
fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)
st.plotly_chart(fig, use_container_width=True)

Arrange components

Place each component(e.g., title, button, input form, and graphs) using Streamlit.

Arranging components is intuitive so that you can place each part easily. If you want to see examples, visit https://streamlit.io/gallery?category=science-technology.

One tip to make use of Streamlit is that you can use st.session_state to preserve some variables.

For instance, you can save your current size if you want to see a stock line chart with multiple window sizes. Streamlit always re-runs everything, so you can not keep your state unless you use the variable.

You can write a code like this in that case.

# place four button with different window size
# if you push "Year" button, then window size is saved as 360

if st.button("3Year", key="portfolio"):
  st.session_state.window_size = 1080
if st.button("Year", key="portfolio"):
  st.session_state.window_size = 360
if st.button("Quarter", key="portfolio"):
  st.session_state.window_size = 90
if st.button("Month", key="portfolio"):
  st.session_state.window_size = 30

This code looks like this.

Save window size as session state

Deploy Streamlit Cloud

Finally, deploy our program to cloud service to share our apps with people. I will use Streamlit Cloud.

All you have to do is select your repository and configure some settings after signing up.

Configure deploy settings

Streamlit Cloud will install the libraries automatically and deploy our app immediately (if you use Poetry or some package management library).

After a while, our app will be deployed like this.

https://koyaaarr-invest-analytics-ui-app-blie7w.streamlitapp.com/

You can choose the privacy of your app by these settings.

Manage option

Conclusion

In this article, I explained how to process the stock data, visualize them, organize the dashboard, and deploy them to the cloud. I hope this article helps you.

Python Development Setup for Data Scientists in 2022

Ryo Koyajima / 小矢島諒 — Sat, 18 Jun 2022 08:30:32 GMT

Photo by ian dooley on Unsplash

There are a lot of useful tools and libraries appearing in recent years. Some don't seem to be famous among data scientists, while engineers often use them. Thus, I want to introduce some tools to data scientists new to Python or software development. In this article, I will show my favorite Python development tools to do data science.

I intend to introduce data scientists who want to …

use both Mac and Windows (WSL)
deploy code to cloud services like Google Cloud Run
handle several projects simultaneously
manage environmental setting by Git

Table of Content

Visual Studio Code(vscode); free and useful editor
Peacock; color schema manager [Recommended]
Rainbow CSV; coloring CSV file
autoDocstring; document generator
pyenv; version manager
Poetry; powerful package manager [Recommended]
Black, Flake8, isort, and Mypy; formatter and linter

Visual Studio Code(vscode); free and useful editor

https://code.visualstudio.com/

Visual Studio Code(vscode) is one of the most famous editors.
Vscode is also for data scientists because we can use Jupyter Notebooks in vscode and Python files. You don't have to code in browsers anymore.

Jupyter Notebook in vscode (Image by author)

Peacock; color schema manager

Peacock - Visual Studio Marketplace

Peacock is one of my favorite extensions in vscode.
You can change the color schema with Peacock by the following steps.

"Ctrl(Command) + Shift + P" in vscode
type "Peacock: Change to a Favorite Color"
select your favorite one

Of course, you can set up your color schema by typing "Peacock: Enter a Color" and inputting the hex code.

Select your favorite color (Image by author)

Advantages for data scientist:
When you work on several projects simultaneously, peacock is quite dependable.
It is because you distinguish the project by its looking so that you can prevent mix-up projects.
In addition, you can control the color schema with Git so you can use the same color with different computers.

You can distinguish the project you want to work on (Image by author)

You can control the color by Git (Image by author)

Rainbow CSV; coloring CSV file

https://marketplace.visualstudio.com/items?itemName=mechatroner.rainbow-csv

If you are a data scientist, you have a lot of chances to see CSV files. Rainbow CSV can colorize your CSVs in each column. Excel is a good tool for seeing CSV, but it takes much time to open the files. Try this extension if you want to see CSV at a glance.

Colorizing dataset (Image by author)

autoDocstring; document generator

https://marketplace.visualstudio.com/items?itemName=njpwerner.autodocstring

autoDocstring is a document generator that helps you to write maintainable code. Once you define the arguments and return values in your method, this extension generates the document template.

type double quotation three times, then the document will be generated (Image by author)

pyenv; version manager

https://github.com/pyenv/pyenv

pyenv is a famous version manager for Python. To install on Mac, you can use brew install pyenvcommand. If you are a Windows user, try the following commands.

git clone https://github.com/pyenv/pyenv.git ~/.pyenv
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc

Then install a specific version(e.g., 3.9.11) of Python.

pyenv install 3.9.11

I recommend that you designate the version in the working directory by this command.

pyenv local 3.9.11

You will find the file generated by the command so that you can control the Python version in Git.

pyenv generates version file (Image by author)

Poetry; a powerful package manager

GitHub - python-poetry/poetry: Python packaging and dependency management made easy

Poetry is a Python library manager that can solve between libraries. Compared to pip, Poetry can manage libraries more smartly. This separates libraries into two types; one is the list you want to install, and the other is the list of whole libraries used by the former. (Just like npm module in Javascript)

For instance, if you install pandas with poetry, it is defined in the former file, and whole packages are described in the latter.

Former defines only pandas and Python itself (Image by author)

Latter describes all the packages that are used by pandas (Image by author)

These files are automatically updated when you install new packages. You don't need to do the pip freeze command anymore.

Moreover, Poetry can generate a virtual environment so that you can execute Python in an isolated environment. Therefore, you don't need to worry about unintended dependencies.

Here is a quick start to Poetry.

$ pip install poetry # install Poetry
$ poetry config virtualenvs.in-project true --local # generate venv in working directory
$ poetry init # initial settings of Poetry
$ poetry add pandas # install package e.g. pandas
$ poetry shell # launch virtual environment

If you've installed Poetry, don't forget to set Poetry's virtual environment as the default interpreter of your vscode.

select poetry virtual environment (Image by author)

Once you’ve set up poetry and control pyproject.toml , poetry.lock , and poetry.toml by Git, you can use and share with your teammate the same environment you’ve created.

Black, Flake8, isort, and Mypy; formatter and linter

These packages faster your coding and realize neat programs.

These are only used in a development environment so that you can install them with -D option.

poetry add -D black flake8 isort mypy

Then modify vscode settings via settings.json. You can enable the above linters and formatters explicitly.

"python.formatting.provider": "black",
"python.linting.flake8Enabled": true,
"[python]": {
"editor.codeActionsOnSave": {
"source.organizeImports": true
},
"python.linting.mypyEnabled": true,

Conclusion

I've introduced several valuable tools for data scientists to set up a Python environment. I uploaded sources in this repository(https://github.com/koyaaarr/python-setup).

I hope this article is helpful to you.

Python Development Setup for Data Scientists in 2022 was originally published in CodeX on Medium, where people are continuing the conversation by highlighting and responding to this story.

dbt and BigQuery in Practice; A Use-Case of Transforming Stock Data

Ryo Koyajima / 小矢島諒 — Thu, 02 Jun 2022 10:44:46 GMT

Updated 2022/7/22: update data pipeline as follows;

create an additional warehouse to store calculated portfolio value
-> to isolate each data mart to avoid being affected by changes in each mart

Updated 2022/6/11: pushed source code to GitHub: https://github.com/koyaaarr/invest-analytics-model

Introduction

This article explains how to use dbt and BigQuery to transform actual data.

You can easily create a data lake, data warehouse, and data mart using dbt. It also enables us to test our data quality. I will combine BigQuery with dbt to transform actual stock data into a data mart used by my dashboard.

This article relates to the following one, so please read it if you have some time.

https://koyaaarr.medium.com/a-practical-use-case-of-cloud-native-and-secured-dashboard-with-google-cloud-and-python-streamlit-a66e60d62ca8

Then, let's get started.

Modeling

Before getting into the transformation, we need to define the data schemas of each table.

Here is the image of the tables we need.

data pipeline

On the data mart side, I want to see the overall performance of my portfolio, and each stock ratio consists of that. Therefore, two data marts are needed to create for these purposes.

On the other hand, each stock data(VOO, BTC-USD, BND) is stored in Google Cloud Storage. Their format is CSV and contains dates and values like closing price.

Source Stock Data

Therefore, I need to aggregate those data sources into the data warehouse and transform them into each data mart.

Each data schema is described following section.

Introducing dbt and BigQuery

Here are the prerequisites of this use case. I will use dbt CLI and install using Python.

Python: 3.9.11
dbt-core: 1.1.0
dbt-bigquery: 1.1.0

First of all, you can initialize dbt by the following command.

dbt init

This command creates a lot of files and directories.

Then you can make "profiles.yml" in the same directory as "dbt_project.yml". This file is generated in "~/.dbt" by default, but I recommend you make this in your working directory to control by git.

In the beginning, you will edit "models/", "dbt_project.yml", and "profiles.yml".

Let's take a look at each file.

"dbt_project.yml" defines the configuration of the project. You will edit the bottom of this file. There are tables we create, and you can specify each table's materialization types.

name: 'invest_analytics'
version: '1.0.0'
config-version: 2

~~~

models:
  invest_analytics:
    invest_analytics_dev:
    +materialized: view
      warehouse-date:
      warehouse-stock:
      warehouse-num-hold:
      warehouse-portfolio:
      mart-portfolio-value:
        +materialized: table
      mart-portfolio-ratio:
        +materialized: table

"profiles.yml" defines system configuration, including connection with BigQuery. If you authenticate using a service account, you need to designate the key file.

invest_analytics:

outputs:

dev:
  dataset: invest_analytics_dev
  job_execution_timeout_seconds: 300
  job_retries: 1
  keyfile: ../service_account.json
  location: asia-northeast1
  method: service-account
  priority: interactive
  project: invest-analytics-347211
  threads: 1
  type: bigquery
  target: dev

"models" directory contains SQLs and "schema.yaml".

You can write standard SQL in dbt, but the only different thing is its source table.

You need to define the referenced table with dbt's format like this instead of the ordinal format.

select
    Date
  , cast(close_voo as integer) as close_voo
  , cast(close_btcusd as integer) as close_btcusd
  , cast(close_bnd as integer) as close_bnd
  , cast(close_total as integer) as close_total
from
  {{ ref('warehouse-stock') }} as st
  left outer join {{ ref('warehouse-portfolio') }} as pf 
    on st.Date = pf.Date
order by
  Date

If your table is generated by source data like CSV, you can define source data like this.

select
    max(case ticker when 'VOO' then num_of_hold else null end) as num_voo
  , max(case ticker when 'BTC-USD' then num_of_hold else null end) as num_btcusd
  , max(case ticker when 'BND' then num_of_hold else null end) as num_bnd
from
  {{ source('invest_analytics_dev', 'source-portfolio') }}

Finally, you need to define the data schemas in "schema.yaml" like this.

version: 2
sources:
  - name: invest_analytics_dev
    tables:
      - name: source-voo
      - name: source-btcusd
      - name: source-bnd
      - name: source-portfolio

models:
  - name: mart-portfolio-ratio
    description: ''
    columns:
      - name: ticker
        description: ''
        tests:
          - unique
          - not_null
          - accepted_values:
            values: ['voo', 'btcusd', 'bnd']
      - name: close_percent
        description: ''
        tests:
          - unique
          - not_null

If you have data sources imported from CSV files, you can write them in the "sources" part.

Then you can add your tables' data schema. In addition, you can define tests for each column. This example contains the "uniqueness test", "not null test", and "accepted values test".

Once you've finished defining each file, let's generate tables by this command.

dbt run — full-refresh — profiles-dir .

Then you get the result like this.

12:11:02  Running with dbt=1.1.0
12:11:02  Unable to do partial parsing because a project config has changed
12:11:03  Found 5 models, 17 tests, 0 snapshots, 0 analyses, 191 macros, 0 operations, 0 seed files, 4 sources, 0 exposures, 0 metrics
12:11:04  Concurrency: 1 threads (target='dev')
12:11:04  1 of 5 START view model invest_analytics_dev.warehouse-date .................... [RUN]
12:11:06  1 of 5 OK created view model invest_analytics_dev.warehouse-date ............... [OK in 1.61s]

~~~

12:11:11  5 of 5 START table model invest_analytics_dev.mart-portfolio-ratio ............. [RUN]
12:11:14  5 of 5 OK created table model invest_analytics_dev.mart-portfolio-ratio ........ [CREATE TABLE (3.0 rows, 62.7 KB processed) in 3.27s]
12:11:14  Finished running 3 view models, 2 table models in 11.36s.
12:11:14  Completed successfully
12:11:14  Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5

You can see each table is created in Google Cloud Console.

BigQuery console

If you want to check the quality of the data, run this command.

dbt test — profiles-dir .

Then, you get the result like this.

10:04:03  Running with dbt=1.1.0
10:04:03  Found 5 models, 17 tests, 0 snapshots, 0 analyses, 191 macros, 0 operations, 0 seed files, 4 sources, 0 exposures, 0 metrics
10:04:04  Concurrency: 1 threads (target='dev')
10:04:04  1 of 17 START test accepted_values_mart-portfolio-ratio_ticker__voo__btcusd__bnd  [RUN]
10:04:06  1 of 17 PASS accepted_values_mart-portfolio-ratio_ticker__voo__btcusd__bnd ..... [[32mPASS[0m in 2.16s]

~~~

10:04:31  17 of 17 START test unique_warehouse-stock_Date ................................ [RUN]
10:04:32  17 of 17 PASS unique_warehouse-stock_Date ...................................... [[32mPASS[0m in 1.33s]
10:04:32  Finished running 17 tests in 28.88s.
10:04:32  Completed successfully
10:04:32  Done. PASS=17 WARN=0 ERROR=0 SKIP=0 TOTAL=17

Lastly, let's generate the document of our tables by the following command.

dbt docs generate — profiles-dir .
dbt docs serve — profiles-dir .

Then you can see the table definitions and lineage.

table definition

lineage graph

Conclusion

I hope you can find this article helpful. I explained the actual use case of dbt and BigQuery with stock data. You can create tables according to their dependencies, test data quality, and even generate the definition documents.

dbt and BigQuery in Practice; A Use-Case of Transforming Stock Data was originally published in Dev Genius on Medium, where people are continuing the conversation by highlighting and responding to this story.

A Practical Use-Case of Cloud-Native and Secured Dashboard with Google Cloud and Python Streamlit

Ryo Koyajima / 小矢島諒 — Wed, 25 May 2022 12:13:59 GMT

Demo

Introduction

With rising cloud services and data scientist-friendly visualization tools, building a dashboard is getting easier and faster.

However, it’s also becoming more and more complicated to understand or utilize them.

This article will show the use-case of combining these technologies by building a secured dashboard managing my investment portfolio.

This article explains the application from three perspectives; business, data science, and engineering. These are often defined as essential skills in data science. Therefore, I intend to break down my explanation into these sections so you can read them in which you’re interested.

Data Science Skill’s Venn Diagram

Business Persipective: Requierments

Though this article focuses on technology, it wouldn’t be convincing if my app is not unpractical(even if this is only for personal use).

Therefore, I will define some requirements before the implementation.

By the way, I’ve bought some ETFs monthly, but I’m not sure what’s going on in my portfolio. This is because the prices of ETFs are varied and go up and down day by day. In addition, I don’t check my portfolio frequently because I don’t want to spend much time watching the stock markets. These things remind me of creating an app satisfying the following requirements.

1. show specific ETFs I’m interested in to see whether each stock is a bargain or not
2. show the current value of my portfolio to check how good or bad
3. show the ratio of the types of ETFs (e.g., stock/bond/commodity) to help me to decide whether I need to rotate my portfolio according to the best ratio of the types of assets
4. update daily because I’ll check this daily at most
5. authentication is required to hide my tangible assets(This is IMPORTANT)!

What I want to see

In addition to the above requirements, UI should be handy but provide sufficient information. Just between iPhone’s stock app and TradingView is ideal for me.

Target Position

Data Science Perspective: Data Modeling and Build Data Pipeline

I need to prepare a data mart for my dashboard to meet the above requirements. The data mart is one of the concepts in the data model, and this also includes the data warehouse and the data lake. These concepts have different purposes so let me explain them in the following table.

Data Model

There are two types of visualization needed, so I will create two data marts and a data warehouse that can provide enough data for data marts.

Data Pipeline

Now let’s get into the data schema of data marts. The first data mart is to plot a line chart of my portfolio and stocks, so historical values need to be prepared. The second one is to plot a pie chart of the ratio of my portfolio, so each stock’s ratio needs to be calculated.

Calculate Portfolio Value

https://medium.com/media/730d364002a80ba278eba29837ab0004/href

Calculate Portfolio Ratio

https://medium.com/media/3993d05909ff9416b0f3c342471321a0/href

The data modeling in detail is omitted due to space limitations. In the next article, I will introduce Google BigQuery and dbt in this data pipeline to explain modeling.

Ref: https://koyaaarr.medium.com/dbt-and-bigquery-in-practice-transform-stock-data-1771e2393319

Engineering Perspective: Architecture and Software

Finally, select appropriate software and services and combine them to realize my system. Here is the whole architecture.

Architecture

Let me explain each component for each role.

Data Retrieve, Transform, Accumulate Script
- Cloud Function: for data retrieving, transforming, and accumulating
- Cloud Storage: data will be served from here via API
- pandas-datareader: to get stock data
- gcsfs: to get data from Cloud Storage

Data Visualize Application
- Cloud Run: run application containerized with Docker.
- Cloud IAP(Identity-Aware Proxy): add authentication to Cloud Run app without coding
- Streamlit: serve a dashboard quickly and nicely
- Plotly: plot graphs quickly and nicely

Operation, CI/CD
- Cloud Build: Connect with GitHub to automatically and immediately deploy to Cloud Run / Cloud Functions after git push
- Cloud Scheduler: trigger Cloud Function regularly
- Cloud Pub/Sub: the same purpose with scheduler

How to use it regularly

I use the YAML file to simplify the operation of managing my portfolio. It is to configure my portfolio that contains the number of stocks I have and the details of each stock.

https://medium.com/media/32dd9bb9503fdab8ee19d1e6112535b6/href

All I need to do is to modify the number of stocks I hold in this YAML file when I buy some stocks. After git push, Cloud build detects that and copies the YAML file to Cloud Storage automatically, then Cloud Function calculates according to the data so Cloud Run can fetch the latest data from there.

Operation

Lastly, what my dashboard looks like is this.

Authentication is required

Dashboard Overview

Conclusion

I intend to break down my application into three perspectives. There are a lot of valuable services like Cloud Run and Cloud IAP. These look complicated to use but are quite helpful in building an application quickly, so I strongly recommend diving into there. This article explains how to create the dashboard using Google Cloud and Python Streamlit. I hope you will find this helpful.

Reference

- Data Science Venn diagram (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
- Trading View (https://www.tradingview.com/)
- iPhone Stocks app (https://apps.apple.com/us/app/stocks/id1069512882)
- Enabling IAP with Cloud Run (https://cloud.google.com/iap/docs/enabling-cloud-run)

I, as a data scientist, will show you why Jupyter Notebook and Jupyter Lab are good for data…

Ryo Koyajima / 小矢島諒 — Thu, 20 May 2021 20:40:27 GMT

I, as a data scientist, will show you why Jupyter Notebook and Jupyter Lab are good for data analysis

This is how data is visualized using Jupyter Lab in the demo in this article

For those who want to get started with data analysis in Python

This article will introduce Jupyter Notebook and Jupyter Lab (collectively called Jupyter), very reliable tools for data analysis in Python.

Jupyter is already in common use in the data science world, but I would like to show its benefits with a demo.

Assumptions

In this article, I analyze data under the following conditions.

Analyze table data, not unstructured data such as images and texts
Analyze data of several GB or tens of thousands of records, rather than data of several TB or hundreds of millions of records
Do the exploratory analysis, rather than routine analysis

What is not written

The following items are not covered in this article. If you want to use Jupyter after reading this article, please refer to other websites or books.

How to build Jupyter Notebook and Jupyter Lab environment
Basic operations of Jupyter Notebook and Jupyter Lab
How to use pandas data structures and methods

What a time-consuming process data analysis is!

Exploratory data analysis is time-consuming. I think this is because it requires thousands of trial and error. In conventional development, trial and error often mean fixing a bug in the code or modifying an algorithm. However, there is much more trial and error from the data perspective when it comes to data analysis. You have to look at the data from various angles, verify the quality of the data, and even modify the code when you realize that the data definition you heard from the business department is different…

Therefore, in exploratory data analysis, it is important to be able to do trial and error as quickly as possible.

You also need to report the results of the exploratory data analysis to your boss or clients. Because of the nature of reporting analysis results, the report (PowerPoint, etc.) will contain many tables and graphs, which is an unexpectedly difficult and time-consuming task.

So, it is also important to be able to prepare tables and graphs quickly.

Two benefits of Jupyter

Time-consuming is the bane of exploratory data analysis but Jupyter can alleviate this bane. For example, it has the following advantages.

Faster trial and error iteration
- You can get execution results for each row (each cell)
- Variables are saved while Jupyter is running so that you can use them multiple times
Easy to see the execution results
- Tabular data is easy to read
- Graphs are printed right below the code

I would like to demonstrate these benefits with a demo.

Demo with rental apartments data

I will use rental apartment data in Chuo-ku, Tokyo that I got from SUUMO(a Japanese rental apartments website) to demonstrate the advantages of Jupyter(*1). I like Jupyter Lab, so I will use it for this demo.

The purpose of the data exploration is to visualize the distribution of rents fee of rental apartments.

First, we need to import pandas and load the data. If a character encoding error occurs due to Japanese or Windows characters, pass encoding=’CP932' as an argument.

# Load the library
import pandas as pd

# Read in the data
apart = pd.read_csv('apartments_20210410_chuo.csv')

Once the data has been read, use the head() method to display and check the data. This head() method is so good that you can see the tabular data very easily and clearly. In my opinion, it is possible to use this table’s screenshot for reports. (Of course, it depends on who you are reporting to. If you are working with an external client, it is better to export to a CSV file and use a PowerPoint table.)

# Display the data
apart.head()

Image by author

The default output is 5 lines, but you can change the output lines passing a number as an argument. In my usage, I use 5 lines (default) when I want to see the columns and values of the data, 1 line when I want to save the data to see later, and 100 lines when I want to see the data itself.

The purpose of this demo is to visualize the rent. The rent is in the form of “10 万円” which contains kanji so we need to omit these characters and convert them into an int type number.

We will combine the lambda expression with the map function to fix the rent column. One of the advantages of Jupyter is that you can iterate like this, thinking and executing processes on the fly. (This may be a good point about the interactive environment rather than Jupyter…)

# Erase '円'
# If there is '万', remove it and multiply by 10000
apart[‘rent_yen’] = list(map(lambda x: x.replace(‘円’, ‘’), apart.rent))
apart[‘rent_yen’] = list(map(lambda x: float(x.replace(‘万’, ‘’))*10000 if ‘万’ in x else x, apart.rent_yen))
apart[‘rent_yen’] = apart[‘rent_yen’].astype(‘int’)

There is a function called apply() in pandas that can do the same thing as map(), but I recommend using map() for its speed. However, map() can only process one column of the DataFrame at a time, so if you need to process values from multiple columns in one row at the same time, use apply().

By the way, when you rent an apartment in Japan, you usually sign a two-year contract. condominium fees and gratuity are also required. To calculate the whole cost more accurately, let’s try to calculate the cost over two years. Specifically, we will calculate the sum of two years of rent (24 months), plus two years of condominium fees (24 months), plus the gratuity.

# Erase ‘円’
# If there is ‘万’, remove it and multiply by 10000
apart[‘condo_fee_yen’] = list(map(lambda x: x.replace(‘円’, ‘’), apart.condo_fee))
apart[‘condo_fee_yen’] = list(map(lambda x: float(x.replace(‘万’, ‘’))*10000 if ‘万’ in x else x, apart.condo_fee_yen))
apart[‘condo_fee_yen’] = apart[‘condo_fee_yen’].astype(‘int’)

We will convert the condominium fee into yen the same as rent.

But when we applied the same function, we get an error. It seems that it could not be converted to a numeric type because there was a hyphen.

Image by author

We check the data and we will see that hyphen indicates that the condominium fee is free.

Image by author

Even if errors occur, Jupyter itself is still running. Thus the variables and libraries that have been calculated and loaded are still alive, so we can try again. This is another advantage of Jupyter.

We’ll create a function to handle hyphens, but it’s a bit too complicated to write it as a lambda function so we’ll write it as a method.

def extract_jpy(x):
  “””
  — Erase ‘円’
  — hyphen is replaced by 0
  — If there is ‘万’, remove it and multiply by 10000
  — convert into integer
  “””
  x = x.replace(‘円’, ‘’)
  x = x.replace(‘-’, ‘0’)
  if ‘万’ in x:
    x = x.replace(‘万’, ‘’)
    x = float(x)*10000
  return int(x)

apart[‘condo_fee_yen’] = list(map(extract_jpy, apart.condo_fee))

It looks like we have successfully converted the condominium fee into a number.

We can now do the same for the gratuity.

apart[‘gratuity_yen’] = list(map(extract_jpy, apart.gratuity))

Since we are doing the same processing here as for the condominium costs, we can copy the cells and use them. Jupyter has some useful shortcut keys that can be used for quick operations. In particular, I often use c: copy cell, x: cut cell, v: paste cell, z: undo cell operation, a: add new cell above, b: add new cell below. I also recommend using ESC: switch to cell operation mode and Enter: switch to code input mode, as they will accelerate your work.

We can see the three columns we have created have been handled well.

apart.head()

Image by author

It looks like “rent_yen”, “condo_fee_yen”, and “gratuity_yen” are all well extracted as numerical values.

Now, let’s calculate the whole cost for two years using the pandas apply function.

apart[‘cost_2years’] = apart.apply(lambda x: (x.rent_yen + x.condo_fee_yen)*24 + x.gratuity_yen, axis=1)

apart.head()

Image by author

It looks like we have successfully calculated the whole cost over two years. We are now ready to visualize the data.

Now we will visualize the data. We will use plotly. This is my favorite library because of its ease of use and beautiful visualization. In particular, the appearance is great so that it can be used for PowerPoint as is. (Unlike seaborn/matplotlib, Japanese is not garbled by default, which is also nice.)

# Load the library
import plotly.express as px

# Visualize
fig = px.histogram(apart, x='cost_2years')
fig.show()

Image by author

The histogram shows a wide distribution, from 200k to 23M. 20M JPY apartments are too expensive to live for me, so we’ll filter the threshold to 10M, which covers most of the data.

fig = px.histogram(apart.query(‘cost_2years <= 10000000’), x=’cost_2years’)
fig.show()

Image by author

We can see that there are several mountains in this graph. The distribution might be different depending on the room layout, so let’s try to visualize it by color-coding according to the layout.

fig = px.histogram(apart.query(‘cost_2years <= 10000000’), x=’cost_2years’, color=’layout’,barmode=’overlay’)
fig.show()

Image by author

We can see that the distribution differs depending on the room layout. Now let’s compare the distribution of 1K and 1LDK. Since most of the data is up to 7M JPY, we will filter by 7M.

fig = px.histogram(apart.query(‘cost_2years <= 7000000 and (layout == “1K” or layout == “1LDK”)’), x=’cost_2years’, color=’layout’,barmode=’overlay’)
fig.show()

Image by author

We can now visualize that the distribution is neatly divided into two mountains. This is the end of the analysis in this demo, but there are many things to explore, such as what contributes to the price distribution besides the room layout.

In this way, Jupyter is efficient when you look at data for the first time, and you don’t know what kind of data, what kind of data type, what kind of data format, and what kind of distribution, or the work that needs to be done comes up while exploring data.

Appendix: Issues with Jupyter and how to solve them

Finally, I’d like to list some of my concerns about using Jupyter and how to handle them. The increase of technical debt is a problem not only for Jupyter, but also for machine learning systems, and I think there is still room for improvement.

Appearance
> In JupyterLab, you can choose a dark theme by default.
Code Completion
> Use a library for completion (such as jupyterlab-lsp)
Increasing technical debt
> Use jupytext to generate .py files and version them with git
> Cut code into py files as needed and use them as methods
> Write documentation
> Write test code

*1: Data collection from websites for data analysis doesn’t violate any laws in Japan unless it relates to personal information or it putting a high workload on the servers.

Between Machine Learning PoC and Production

Ryo Koyajima / 小矢島諒 — Mon, 01 Feb 2021 17:54:02 GMT

the final architecture of this article

The Japanese version is here:
(https://qiita.com/koyaaarr/items/259ad4f0d574497c5b08)

Introduction

Machine learning Proof of Concept (PoC) is very popular these days due to the recent AI boom. And afterward, if (very fortunately) you get good achievement in the PoC, you may want to put the PoC system into production. However, while a lot of knowledge has been shared about exploratory data analysis and building predictive models, there is still not much knowledge on how to put them into practice, especially in production.

In this article, we will examine what is needed technically during the transition from PoC to production operations. I hope that this article will help you to make your machine learning PoC not only transient but also create value through production.

What is written in this article

How to proceed with data analysis in a PoC
How to proceed with the test operation of the machine learning PoC (the main topic of this article).
Architecture in each phase of PoC and test operation (the main topic of this article).
Additional things to consider for production operations

I will focus especially on test operations. During test operations, operations and analysis are often done in parallel, and I will describe an example of how to update the architecture of the system while balancing operations and analysis.

What is not written in this article

Details on exploratory data analysis
Details on preprocessing and feature engineering
Details on building predictive models
Lower layers than middleware (databases and web servers)
Consulting skills to handle Machine Learning PoC

Consulting skills are very important in machine learning projects because of their uncertainty but are not included in this article as the focus is on the technology.

Systems assumed in this article

Use a relatively small dataset, less than 100 GB
Handle data that can be stored in memory, rather than data in the hundreds of millions of records
Batch learning and batch inference
Not perform online (real-time) learning and inference
System construction proceeds in parallel with data analysis
Not have concrete requirements to create in the beginning so we build them as needed while proceeding

Data used in this article

We will use data from a previous Kaggle competition, “Home Credit Default Risk” in this article. This competition uses an individual’s credit information to predict whether or not they will default on their debt. There are records for each loan application in the data, and each record contains information about the applicant’s credit and the label indicating whether the person was able to repay the loan or defaulted on it.

In this article, we will assume that we are in the data analytics department of a certain loan lending company. Under this assumption, you want to utilize machine learning to automate credit decisions based on this credit information.

For the sake of explanation, we will divide “application_train.csv” among the data available in this competition as shown in the figure. The split data will be used under the following assumptions.

initial.csv: Past credit information, to be used in PoC
20201001.csv: Credit information for October 2020. In the test operation, this data will be handled as training data together with “initial.csv”.
20201101.csv”: Credit information for November 2020. In the test operation, this data is handled together with “initial.csv” as training data.
“20201201.csv”: Credit information for December 2020. In test operations, this data is handled together with “initial.csv” as training data.
“20210101.csv”: Credit information for January 2021. In the test operation, we will start forecasting from this month.

The actual code for splitting the data is shown below.

split_data.ipynb

https://medium.com/media/4672e04f6b37c9cbb32648e60a3a7732/href

Situation to be considered

In this article, for ease of explanation, we will assume the following project. The following story is based on the author’s imagination based on the data of “Home Credit Default Risk” and has nothing to do with the actual company or business. The author is a complete novice in the field of credit operations and may differ greatly from actual operations.

As a data scientist, I am participating in a project to automate the credit approval process at a loan lending company. The credit judgment work is done manually by the screening department, but we are considering whether machine learning can be used to reduce man-hours and improve the accuracy of credit judgment. Sample data has already been provided, and we are in the PoC stage. The sample data is a record of past loan defaults by borrowers. Based on this data, if someone wants to take out a new loan, we would like to be able to predict whether or not that person will default on the loan so that we can decide whether or not to lend the loan.

Scope of the project in this article

A machine learning project usually goes through planning, PoC, test operation, and production operation. In this article, to focus on the technical points, I will describe the scope from PoC to test operation. In particular, I will divide the test operation into three phases, since a lot of functions are required to move to production. Since the author has little experience in production operations, I only mention the points that should be considered for production operations.

Structure assumed in this article

In this article, I will assume a minimal structure, as we are going to start the project small. Specifically, there is a consultant who will communicate with the business department (the credit judgment department) and a data scientist who will perform everything from data analysis to system construction. In reality, there is a manager as a supervisor, but they will not appear in this article. Also, as a stakeholder, there is a person in the business department.

PoC phase

Purpose of this phase

The purpose of this phase is to verify whether it is feasible to automate credit decisions. In this phase, we will examine two main points: one is to validate the data otherwise whether the provided data can be used in production (e.g. whether the data can be used in forecasting and whether there is no relationship between records), and the other is to determine how accurately the defaults can be predicted by machine learning.

Architecture in this phase

In this phase, we will work only with JupyterLab. MLflow is included for storing the machine learning models, but (in my opinion,) it is not necessary at the beginning.

Data validation

If you are a data scientist, you want to start looking at the data right away, but first, you need to validate the data. This is because if the data is flawed, any predictions made using the data will likely be useless. Validation includes two main points: first, for each record, when is each column data available. The data for each column may seem to be available at the time it is provided to us, but that doesn’t mean that they are available at the same time. For the simplest example, the objective variable “whether the debtor has defaulted” will be known later than the other columns. Another point to check is to see if there is any relationship between the records. For example, if a person applied for a loan twice and the first record in the training data and the second record in the test data, the prediction will take an advantage in a bad way. In such a case, you can make sure that both records are included in either the training data or the test data. In addition to these points, it is also important to clarify the definition of the data by interviewing the business department about what each column means and what the unit of the record is (e.g. in this data, is it per person or loan application?). You may use a spreadsheet to check these checkpoints for each column of the data.

Exploratory data analysis

Once the data has been validated (or in parallel with the validation), we can use Jupyter Lab to see what columns (features) are present by visualizing the sample data. This process will help you understand the data and do feature engineering and model selection. It is also useful to find problems in the data.

First, for each column, we will check the data type, percentage of missing values, etc.

eda.ipynb

https://medium.com/media/3a30e8fa48960687f41db7ea31b17227/href

Next, to see the distribution, we will visualize it. If the data type is numeric, we will use a histogram, and if the data type is a string, we will use a bar chart.

eda.ipynb

https://medium.com/media/d79447f5a923d717818031a6bff49270/href

Two of the output graphs will be shown as examples. In fact, we should look at the distributions one by one, but we will skip that for now.

AMT_CREDIT

NAME_INCOME_TYPE

Verification of prediction accuracy

From here, we will actually create the model and verify the prediction accuracy. In this case, we will use the AUC of ROC, which is the same evaluation indicator used in “Home Credit Default Risk”. In reality, we will discuss with the business department and agree in advance on which indicator to use. Before creating a prediction model manually, we will first try to make a quick prediction using PyCaret. This will allow us to compare which features/models are effective and use them as a reference when actually creating the model.

eda.ipynb

https://medium.com/media/7456778d496f89f40c10198278f28d7c/href https://medium.com/media/8957efdfb5d50e3232d8f7af653e2ee4/href

In this article, we will compare the following models provided by PyCaret.

Logistic regression
Decision Trees
Random Forest
SVM
LightGBM

LightGBM seems to be superior when the evaluation metric is AUC. In general, LightGBM seems to be better in both accuracy and execution speed in most cases. By the way, recall is small in all models because of the imbalanced data with few positive examples. Depending on your business goals, you may create a model with a high recall score so that you prevent more bad debts. In this article, we will not do any more detailed modeling and will use LightGBM to build models.

Next, we will create and evaluate a LightGBM model in PyCaret to see which features are effective.

https://medium.com/media/9163ca76116d5e99d346505f21df60eb/href

https://medium.com/media/aa54155d4410a4c1b439cd2cbfdc3cab/href

https://medium.com/media/fb4ea58c279eb14d75d1d97dadedaef7/href

https://medium.com/media/e9b3430d1ebb64edd054998a1fdde301/href

If there are a lot of features, as in the case of this data, reducing the number of the features will increase both the accuracy and stability of the model. A simple way to do that is to calculate the feature importance and exclude the features with low importance. In this case, we will simply use the features with high importance. For the columns that are automatically preprocessed by PyCaret, we will use the original columns.

Now, we will create the prediction model manually.

Preprocessing

For the sake of simplicity, we will only complement the missing values in the preprocessing.

forecast.ipynb

https://medium.com/media/f8ee3555899de8902f083f45c73dd5b4/href

Feature Engineering

Feature engineering involves feature selection and creating dummy variables of categorical features.

forecast.ipynb

https://medium.com/media/1fc066cb2e66c48b6b2948495ac6a505/href

Prediction

Use LightGBM to create a model. Also, use Optuna to tune hyperparameters.

https://medium.com/media/4134ef5807c4f42fcfd526da53fee1fb/href https://medium.com/media/64476a8da29c74b46c2f21ce2ccb22a4/href

https://medium.com/media/e6eec40e65d5a47eeb899dea70607748/href

In this verification of the prediction accuracy, we were able to achieve almost the same accuracy using PyCaret. In reality, we will conduct a more in-depth analysis based on these results, but we finish the verification of the PoC phase with this.

From here on, we will assume that the results of the PoC will be reported to the business department, and this project proceeds through PoC to production. However, the PoC will not suddenly go into production. The PoC system will be gradually brought closer to production through several test operations. Therefore, we will divide the test operation into three phases. In each phase, we will add functions little by little so that the operation will be gradually automated and get closer to the production operation.

Supplement: Machine learning model management

For managing machine learning models, MLflow is useful. It can manage models with each hyperparameter explored by Optuna, which will be useful as the number of model trials increases.

Test Operation

The three phases of test operations

Before we can go from PoC to production, we need to implement some features such as automation of operations. However, it would be difficult in terms of man-hours to implement all the necessary functions right away. (Besides, at this stage, you are probably being asked by the business department to further improve the accuracy.) Therefore, we will divide the necessary functions into three phases and implement them gradually, so that we can expand the functions as we operate. In each phase, we will implement the following functions respectively:

Building data pipelines and semi-automated operations
Implementation of regular operation API
Migration to the cloud and automation of operations

Test Operation Phase 1: Building data pipeline and semi-automated operations

Purpose of this phase

In this phase, we will partially automate the system created in the PoC. Before that, we will build a data pipeline by dividing and organizing the PoC program into blocks such as feature engineering and prediction. This will allow the training and inference to be executed in isolation or rerun from the middle. Besides, Airflow, a workflow engine, is introduced to enable automatic execution and scheduling execution of all programs divided into each block in order.

Architecture in this phase

In the PoC phase, we used a single Jupyter Notebook for preprocessing and prediction, and so on, but from this phase, we will introduce two OSS to execute multiple Notebooks in order. The first is “papermill”, an OSS that allows us to run Jupyter Notebooks from the command line with parameters so that we can make predictions for different months without rewriting notebooks. Besides, use “Airflow” to run each Notebook in order. This OSS provides not only automatic execution, but also scheduling execution, success and failure notifications, and other useful functions for operational automation.

Data pipeline

Divide the program created by PoC into four blocks: “data accumulation”, “feature engineering”, “learning” and “inference”. When dividing the program into blocks, each block should be loosely coupled to each other by using data as an interface. This will limit the impact of changes in the program logic. For reference, here is an image of the data pipeline in this article. In each block, the month of execution is set to be passed as a parameter from papermill at the beginning of the program, so that it can be executed in a specific month.

The following is the code for each block. Basically, it is a reuse of the program used in the PoC, with some additions and modifications for operational automation.

Data accumulation

accumulate.ipynb

https://medium.com/media/a299dc63ffb80d9b7b934178ae46285a/href

Feature engineering

feature_engineering.ipynb

https://medium.com/media/e2d6b57c4657ae346cc03e6704e60cf8/href

Learn model

learn.ipynb

https://medium.com/media/e37e0ba25a9b1fe9532c54d08f58c0e8/href

Inference

inference.ipynb

https://medium.com/media/13cfbdf8a3acec4fc30c4993cb58bbb9/href

Semi-automating operations

Once each process has been split into individual programs, Airflow can be used to execute them in an ordered manner. By passing the forecasted month as a parameter at runtime, we can run for each month. Also, if you want to schedule the execution, you can define the date and time of the scheduling execution as a cron expression in “schedule_interval”. The Airflow code is shown below.

trial_operation.py

https://medium.com/media/a1b231f1fafa1f8fcead4b8706fca060/href

You can view your defined workflow as a flowchart in Airflow. For example, the above code can be visualized as the following figure. You can see that this diagram has the same structure as the data pipeline we defined earlier. (In the figure, each box is green because the blocks have already been completed successfully.)

With the implementation of test operation phase 1, we were able to automate the monthly operations as shown below. We can see that most parts are becoming greatly automated.

PoC Phase

upload data for the forecast month
Combine training data of previous months
Preprocessing and feature engineering of training data
Train model from training data
Preprocess test data and do feature engineering
Predict the test data using the trained model
download the prediction result

Test Operation Phase 1

Upload the data for the forecast month
Run the Workflow from Airflow
Download the prediction result

Test Operation Phase 2: Implementation of regular operation API

Purpose of this phase

In phase 1, we were able to greatly automate monthly operations by dividing functions such as preprocessing and inference into separate programs and execute in order by combining papermill and Airflow. In this phase 2, we will further automate the process. Specifically, we will prepare APIs and GUI screens to execute data upload/download and regular operations, which were done manually in Phase 1. In this way, even non-engineering users such as consultants and business departments will be able to operate the system easily. In this way, the regular operations can be left to the users, and the engineers can concentrate more on the development tasks.

Architecture in this phase

In phase 2, we will build a web server and create a GUI screen to operate it.

Creating a web server

Prepare the following APIs for the web server.

Upload function for input files
Execution of regular operations
Download function of forecast files

This time, we will use FastAPI to create the webserver.

server.py

https://medium.com/media/d004817103635170bfb2dd5a70e60c42/href

Creating the GUI screen

For the GUI, we need a button to execute the web server API and a form to upload data. In this case, I used React and Typescript to create the GUI on my own, but it may be faster to use a library that creates the GUI, such as streamlit.

App.tsx

https://medium.com/media/fb399aeb798b3f4f65ac182f5d0b813e/href

GUI screen is like the following image.

Test Operation Phase 3: Migration to the cloud and automation of operations

Purpose of this phase

In Phase 3, we will move servers to the cloud and move some functions to managed services to further automate regular operations. The purpose of using the cloud is to increase the availability of the system by delegating operations such as infrastructure to the cloud so that we can focus more on enhancing and maintaining the application. The basic functions are common to all the clouds such as AWS, GCP, and Azure, but each of them has different features and characteristics, so I think it is better to compare them.

In this article, I will briefly examine migration to AWS as an example. There are two migration examples: Pattern 1, in which the system created up in the test operation phase 2 is migrated simply to AWS, and Pattern 2, in which further automation is performed.

Architecture Pattern 1 with AWS: Simple EC2-only configuration

Each server is built on EC2, and data is stored in EBS. The usage is almost the same as local Linux machines and migration should not be difficult. However, uploading of input data and downloading of prediction results still needs to be done manually. Also, since each function of the system is just running on EC2, the ease of enhancement and maintenance has not changed much.

Architecture Pattern 2 on AWS: Further automated configuration

In this pattern 2, the following points that were issued in pattern 1 are improved.

Automation of data input/output
Splitting some functions into individual programs and managed services

To automate data input/output, we use S3 as a shared folder for exchange data with external systems. We can monitor data input/output to S3 using CloudWatch and CloudTrail, and call Airflow’s regular operation API using Lambda. And then, we can run the prediction system by triggering the storage of input files. With this system, there is no need to set up a GUI or a web server. If you set up a web server in the cloud, you will need authentication functions and vulnerability countermeasures, so this will also reduce these risks.

As for splitting some of the functions into individual programs and services, we did the following points.

Changed the storage location of input/output files to S3
Moved the trigger program for system execution to Lambda
Migrated the success/failure notification program to Lambda and SNS

The scope that we were able to divide up this time is not very wide, but I think we can divide up the program further by using other AWS services to make it easier to enhance and maintain. However, if you expand the scope too much, you may end up with vendor lock-in, so you need to consider the ease of migration as well.

We have now completed all considerations up to test operation phase 3. Actually, there are many technical and business hurdles in running an on-demand analysis service of PoC regularly as production, but I hope that the methods we discussed here will be helpful.

Additional things to consider for production operations

Finally, I will list things to consider for production operations in this chapter.

Utilizing the cloud

In test operation phase 3, we migrated to the cloud. Since the cloud has a variety of functions, it is best to utilize them to the extent that they do not significantly sacrifice portability. For example, data governance can be introduced by linking the internal authentication with the cloud authentication function, and the auto-scaling function can be used to handle larger-scale data.

It is also important to eliminate as much of your own code as possible and move to managed services. Considering the long-term operation of the system, you should consider utilizing a service that has similar functions to your program since your own code is not easy to maintain and is also very impersonal. For example, for Airflow, there are managed services such as GCP’s “Cloud Composer” and AWS’s “Amazon Managed Workflows for Apache Airflow”, so using these services is something to consider.

Program reusability

While Jupyter Notebook is easy and convenient for development, it is not easy to manage, run, and test with git. It may be a good idea to migrate to python files as needed, depending on the combination of development speed and quality. Also, if this system itself can be built on Docker and Kubernetes, it will not only increase the robustness of the system and make it easier to scale the process, but it will also have great business benefits such as making it easier to expand to other projects.

Data storing

In this article, data was stored in CSV or Pickle format, but it is good to consider which data to be stored in which format. For this purpose, it is useful to manage the definitions of each data in a spreadsheet when the data pipeline is developed. I often use CSV data that is difficult to recreate (input data) or data that requires external collaboration (forecast results), and Pickle format for intermediate-generated data. Pickle format is convenient, but it is not versatile or robust, so it is better to store in CSV format and define the data type separately or use the “Parquet” format if you know.

Data monitoring

To continuously operate a machine learning system, you need to pay attention to the data as well as the system. For example, if the trend of the input data changes, it may have a significant impact on the prediction accuracy even if there is no problem with the system. Therefore, it is necessary to monitor the input data, for example, to check if the distribution of data in each column and the relationship with the labels have changed. Also, depending on the system you are creating, you need to verify the fairness of the predictions, for example, whether the prediction results vary depending on gender.

Data governance

At the PoC level, access privileges to data may be naturally limited, but as the operation becomes longer and the number of people involved in the system increases, it will become necessary to set appropriate access privileges for each data. In such cases, it is best to utilize the authentication functions of cloud services. For example, by creating individual accounts with AWS IAM, you can flexibly set access privileges to the data stored in S3 according to each individual’s department or position. Also, since cloud services have functions that can be integrated with internal authentication infrastructure, it is a good idea to use these services.

Software and code used in this article

The source of the system built as an example in this article is stored in the following GitHub repository.

https://github.com/koyaaarr/between_poc_and_production

The versions of the main software used are as follows.

Reference

Beyond Interactive: Notebook Innovation at Netflix (https://netflixtechblog.com/notebook-innovation-591ee3221233)

Between Machine Learning PoC and Production was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stories by Ryo Koyajima / 小矢島 諒 on Medium

Stone Soup and Data Science

For the sake of a delicious meal

Build a fellowship

Conclusion

Stable Diffusion Quickstart withWSL2 and RTX3070

Stable Diffusion Quickstart with WSL2 and RTX3070

Objective

Pre-requisites

Quick Guide

How to Deploy Your Jupyter Notebook As a Dashboard: A use case of visualizing stock data with AWS

Introduction

Architecture

A use case for visualizing stock data

Store the stock data in S3

Edit a Jupyter Notebook in SageMaker

Visualize the moving average and MACD of the S&P500 ETF

Add widgets for Mercury

Commit your work

Push your commit to GitHub

Create Dockerfile and requirements.txt

Create ECR private repository

Create yaml for GitHub actions

Create App runner

Access Dashboard

Conclusion

Python Streamlit in Practice; A Use-Case of Visualizing Stock Data

Quick Demo

Development Environment

Data Processing

Visualization

Time-series change of overall portfolio value

Time-series change of the Sharpe ratio

Each stock’s ratio in my portfolio

Organize Dashboard

Load initial data

Process data

Visualize graphs

Arrange components

Deploy Streamlit Cloud

Conclusion

Python Development Setup for Data Scientists in 2022

Table of Content

Visual Studio Code(vscode); free and useful editor

Peacock; color schema manager

Rainbow CSV; coloring CSV file

autoDocstring; document generator

pyenv; version manager

Poetry; a powerful package manager

Black, Flake8, isort, and Mypy; formatter and linter

Conclusion

dbt and BigQuery in Practice; A Use-Case of Transforming Stock Data

Introduction

Modeling

Introducing dbt and BigQuery

Conclusion

A Practical Use-Case of Cloud-Native and Secured Dashboard with Google Cloud and Python Streamlit

Introduction

Business Persipective: Requierments

Data Science Perspective: Data Modeling and Build Data Pipeline

Engineering Perspective: Architecture and Software

How to use it regularly

Conclusion

Reference

I, as a data scientist, will show you why Jupyter Notebook and Jupyter Lab are good for data…

I, as a data scientist, will show you why Jupyter Notebook and Jupyter Lab are good for data analysis

For those who want to get started with data analysis in Python

Assumptions

What is not written

What a time-consuming process data analysis is!

Two benefits of Jupyter

Demo with rental apartments data

Appendix: Issues with Jupyter and how to solve them

Between Machine Learning PoC and Production

Introduction

What is written in this article

What is not written in this article

Systems assumed in this article

Data used in this article

Situation to be considered

Stories by Ryo Koyajima / 小矢島諒 on Medium