<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Ryo Koyajima / 小矢島 諒 on Medium]]></title>
        <description><![CDATA[Stories by Ryo Koyajima / 小矢島 諒 on Medium]]></description>
        <link>https://medium.com/@koyaaarr?source=rss-69519ff4d58c------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*JNVqmOe8z4aDTtcVHMAPXg.jpeg</url>
            <title>Stories by Ryo Koyajima / 小矢島 諒 on Medium</title>
            <link>https://medium.com/@koyaaarr?source=rss-69519ff4d58c------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Wed, 01 Jul 2026 07:42:33 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@koyaaarr/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Stone Soup and Data Science]]></title>
            <link>https://koyaaarr.medium.com/stone-soup-and-data-science-52b47731cbfb?source=rss-69519ff4d58c------2</link>
            <guid isPermaLink="false">https://medium.com/p/52b47731cbfb</guid>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[data-science]]></category>
            <dc:creator><![CDATA[Ryo Koyajima / 小矢島 諒]]></dc:creator>
            <pubDate>Sun, 20 Nov 2022 14:15:29 GMT</pubDate>
            <atom:updated>2022-11-20T14:15:29.346Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2YP6rTZDftVDMcz3f-cN1A.jpeg" /><figcaption>Photo by <a href="https://unsplash.com/@foodography?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Gianluca Gerardi</a> on <a href="https://unsplash.com/?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></figcaption></figure><p>I&#39;ve worked as a Data Scientist for several years and cooperated with non-technical people. Those experiences remind me that &quot;Stone Soup,&quot; a folk story, has a profound insight into working as a Data Scientist. Let me share this story and my thought.</p><h4>For the sake of a delicious meal</h4><p>For those who don&#39;t know this story, here is the story from <a href="https://en.wikipedia.org/wiki/Stone_Soup">Wikipedia</a>.</p><blockquote>Some travelers come to a village, carrying nothing more than an empty cooking pot. Upon their arrival, the villagers are unwilling to share any of their food stores with the very hungry travelers. Then the travelers go to a stream and fill the pot with water, drop a large stone in it, and place it over a fire. One of the villagers becomes curious and asks what they are doing. The travelers answer that they are making “stone soup”, which tastes wonderful and which they would be delighted to share with the villager, although it still needs a little bit of garnish, which they are missing, to improve the flavor.</blockquote><blockquote>The villager, who anticipates enjoying a share of the soup, does not mind parting with a few carrots, so these are added to the soup. Another villager walks by, inquiring about the pot, and the travelers again mention their stone soup which has not yet reached its full potential. More and more villagers walk by, each adding another ingredient, like potatoes, onions, cabbages, peas, celery, tomatoes, sweetcorn, meat (like chicken, pork and beef), milk, butter, salt and pepper. Finally, the stone (being inedible) is removed from the pot, and a delicious and nourishing pot of soup is enjoyed by travelers and villagers alike. Although the travelers have thus tricked the villagers into sharing their food with them, they have successfully transformed it into a tasty meal which they share with the donors.</blockquote><p>This story has several variations, yet we can find some insight.</p><ul><li>Villagers have enough resources to make delicious food, but no one can be aware of the possibility.</li><li>Travelers have the ability to create a soup, but it won&#39;t help if they don&#39;t have enough resources.</li><li>Delicious soup can be made only if villagers and travelers work together.</li></ul><h4>Build a fellowship</h4><p>During our data science project, we often face various hurdles to achieving our goal. It is sometimes a technical matter and sometimes a business matter. Especially for a business, we may have a chance to work with non-technical people and possibly need to convince them. It may become laborious and time-consuming if they are unwilling to work with you proactively.</p><p>At such times, we can remember what travelers did to make a great soup. Travelers showed the possibility to villagers by demonstrating with a stone and water. They convinced each villager and led to achieve the big picture.</p><p>In the same way, we data scientists can build a fellowship with non-technical people to accomplish our goals together. We ask about their most painful issue to clarify our goal. We create a prototype for them to imagine how beneficial it is. Then we can convince them to extract data or provide domain-specific information like the soup&#39;s ingredients to make the product more meaningful. The more elements we can add to our product, the more people we can cooperate with in the same direction, and the faster we can move the project forward.</p><h4>Conclusion</h4><p>In the end, it has long been said that machine learning needs to be implemented in society. Still, it depends on how we can translate AI technology to business value for non-technical people and how valuable the big picture we can draw for them is.</p><p>As an aside, I saw this folk tale in the book &quot;Pragmatic Programmer.&quot; This book is for software engineers, but it&#39;s full of insight so I would recommend it to data scientists.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=52b47731cbfb" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Stable Diffusion Quickstart withWSL2 and RTX3070]]></title>
            <link>https://koyaaarr.medium.com/stable-diffusion-quickstart-withwsl2-and-rtx3070-e8f4e75a47a4?source=rss-69519ff4d58c------2</link>
            <guid isPermaLink="false">https://medium.com/p/e8f4e75a47a4</guid>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[stable-diffusion]]></category>
            <dc:creator><![CDATA[Ryo Koyajima / 小矢島 諒]]></dc:creator>
            <pubDate>Mon, 19 Sep 2022 17:00:25 GMT</pubDate>
            <atom:updated>2022-11-06T10:45:21.163Z</atom:updated>
            <content:encoded><![CDATA[<h3>Stable Diffusion Quickstart with WSL2 and RTX3070</h3><p>~Generate Boss Baby-ish Profile Image ~</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0ShMpjYt15uvsCw_KWAPbw.png" /></figure><h3>Objective</h3><p>To generate my profile image on Twitter.</p><p><em>Unlike LinkedIn or Facebook, Twitter is a bit anonymous service, so I don’t want to use my photo as a profile image. Therefore, I’ve sought a nice picture to use my SNS icon.</em></p><h3>Pre-requisites</h3><ul><li>Windows 10 Home 21H2</li><li>WSL2 Ubuntu 20.04.5 LTS</li></ul><pre># run on wsl to show the version<br>lsb_release -a</pre><ul><li>Kernel: Linux version 5.10.102.1-microsoft-standard-WSL2</li></ul><pre># run on cmd to show the version<br>wsl cat /proc/version</pre><ul><li>CPU: Ryzen 5 5600X</li><li>GPU: GeForce RTX 3070</li><li>RAM: 32GB</li><li>VRAM: 8GB</li><li>Package management: Anaconda</li><li>Not using Docker</li><li>Use optimized stable diffusion due to VRAM limitation</li></ul><h3>Quick Guide</h3><ol><li>Install WSL2 and update the latest version</li></ol><p><a href="https://learn.microsoft.com/ja-jp/windows/wsl/install">https://learn.microsoft.com/ja-jp/windows/wsl/install</a></p><pre># run on cmd<br>wsl --install<br>wsl --update<br>wsl --install -d Ubuntu-20.04</pre><p>2. Install CUDA Toolkit</p><p><a href="https://learn.microsoft.com/ja-jp/windows/ai/directml/gpu-cuda-in-wsl">https://learn.microsoft.com/ja-jp/windows/ai/directml/gpu-cuda-in-wsl</a></p><p><a href="https://docs.nvidia.com/cuda/wsl-user-guide/index.html#getting-started-with-cuda-on-wsl">https://docs.nvidia.com/cuda/wsl-user-guide/index.html#getting-started-with-cuda-on-wsl</a></p><pre># run on wsl<br>wget <a href="https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin">https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin</a><br>sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600<br>wget <a href="https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-wsl-ubuntu-11-7-local_11.7.0-1_amd64.deb">https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-wsl-ubuntu-11-7-local_11.7.0-1_amd64.deb</a><br>sudo dpkg -i cuda-repo-wsl-ubuntu-11-7-local_11.7.0-1_amd64.deb<br>sudo apt-get update<br>sudo apt-get -y install cuda</pre><p>3. Clone Stable Diffusion (Optimized one)</p><p><a href="https://github.com/basujindal/stable-diffusion">https://github.com/basujindal/stable-diffusion</a></p><pre># run on wsl<br>git clone <a href="mailto:git@github.com">git@github.com</a>:basujindal/stable-diffusion.git<br>cd stable-diffusion</pre><p>4. Download model “sd-v1–4.ckpt”</p><p><a href="https://huggingface.co/CompVis/stable-diffusion-v-1-4-original">https://huggingface.co/CompVis/stable-diffusion-v-1-4-original</a></p><p>5. Rename and move the model</p><pre># run on wsl<br>mkdir -p models/ldm/stable-diffusion-v1<br>mv sd-v1–4.ckpt models/ldm/stable-diffusion-v1/model.ckpt</pre><p>6. Install Anaconda</p><pre># run on wsl<br>wget <a href="https://repo.anaconda.com/miniconda/Miniconda3-py38_4.12.0-Linux-x86_64.sh">https://repo.anaconda.com/miniconda/Miniconda3-py38_4.12.0-Linux-x86_64.sh</a><br>bash Miniconda3-py38_4.12.0-Linux-x86_64.sh</pre><p>7. Install Python Packages</p><pre># run on wsl<br>conda env create -f environment.yaml<br>conda activate ldm</pre><p>8. Prepare image</p><pre># run on wsl<br>mkdir img<br>mv [path to your image file] img/001.jpg<br># image file name and path are depend on you. whatever is okay.</pre><p>9. Run Script</p><pre># run on wsl<br>python optimizedSD/optimized_img2img.py --prompt &quot;boss baby&quot; --init-img img/001.jpg --strength 0.8 --n_iter 10 --n_samples 10 --H 512 --W 512</pre><p><strong>Parameters</strong></p><ul><li>prompt: text you want to combine</li><li>init-img: the image you want to combine</li><li>n_samples: number of images generated</li></ul><p>Lastly, check the stable-diffusion/outputs/img2img-samples/boss_baby directory.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*SyYnrxZNL8UZiRnADIxT-Q.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/512/1*4WvQuj_9Zn_q1Etdcw2d0w.png" /><figcaption>My new SNS icon</figcaption></figure><p>I hope this helps!</p><p>*2022/11/6: modify some commands according to a comment.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=e8f4e75a47a4" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Deploy Your Jupyter Notebook As a Dashboard: A use case of visualizing stock data with AWS]]></title>
            <link>https://koyaaarr.medium.com/how-to-deploy-your-jupyter-notebook-as-a-dashboard-a-use-case-of-visualizing-stock-data-with-aws-ebe5791a5fe7?source=rss-69519ff4d58c------2</link>
            <guid isPermaLink="false">https://medium.com/p/ebe5791a5fe7</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[data-visualization]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[aws]]></category>
            <dc:creator><![CDATA[Ryo Koyajima / 小矢島 諒]]></dc:creator>
            <pubDate>Sun, 28 Aug 2022 12:39:16 GMT</pubDate>
            <atom:updated>2022-08-28T12:43:43.326Z</atom:updated>
            <content:encoded><![CDATA[<p>~ Build an automated dashboarding system with AWS SageMaker, GitHub Actions, ECR, App Runner, and Mercury ~</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*GtZojO2c3urWeQ_zxzgHQQ.png" /><figcaption>Transform Jupyter Notebook into Dashboard</figcaption></figure><h3>Introduction</h3><p>Jupyter Notebook is one of the vital tools for data scientists. However, there are still difficulties in collaborating with your teammates and sharing your notebooks with your stakeholders.</p><p>I&#39;ll introduce an automated dashboarding system using AWS SageMaker and Mercury.</p><p><strong>SageMaker </strong>enables us to edit notebooks in a hosted environment in AWS. It&#39;s easy to share your notebooks with your teammates, and no worries about building Jupyter server staff.</p><p><strong>Mercury</strong> is a Python library that can transform Jupyter Notebooks into dashboards. You can easily publish your dashboard to your counterparts by using this library.</p><h3>Architecture</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*noWf5aPNa2HvLUlmIGAGeQ.png" /><figcaption>Automated Dashboarding System Architecture</figcaption></figure><p>For quick understanding, here is the workflow of the whole architecture.</p><ol><li>Developers edit Jupyter Notebooks in Amazon SageMaker.</li><li>Developers push their commits to GitHub.</li><li>GitHub detects the push and automatically runs GitHub Actions, a CI/CD.</li><li>During GitHub Actions, a Docker image is built and pushed to Amazon ECR, a container registry in AWS.</li><li>As soon as the Docker image is pushed to ECR, it will be deployed to Amazon App Runner immediately.</li><li>App Runner hosted Mercury server so that end users can access the dashboard.</li></ol><p>Here are the service or software used in this article.</p><ul><li>Mercury</li><li>Amazon App Runner</li><li>Amazon SageMaker</li><li>GitHub Actions</li><li>Amazon ECR(Elastic Container Registry)</li><li>Amazon S3</li><li>Docker</li></ul><h3>A use case for visualizing stock data</h3><p>Here is a use case of this dashboarding system which visualizes the S&amp;P 500 ETF data.</p><p>Source code: <a href="https://github.com/koyaaarr/invest-analytics-aws">https://github.com/koyaaarr/invest-analytics-aws</a></p><h4>Store the stock data in S3</h4><p>First, you need to store the data you want to visualize. There are several options, and I chose S3 in this use case.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*iFEcRPP6nzLrQ5iOj5WMIg.png" /><figcaption>stored stock data in S3</figcaption></figure><h4>Edit a Jupyter Notebook in SageMaker</h4><p>It&#39;s straightforward to use Jupyter Notebook in SageMaker, even if you&#39;re new to this like me.</p><p>You can launch the SageMaker Studio service and create a new notebook as you do in your local host.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*i0D5DIL3JekN1geBhjCt-w.png" /><figcaption>Create a new notebook in SageMaker</figcaption></figure><h4>Visualize the moving average and MACD of the S&amp;P500 ETF</h4><p>This is analyzing part, and I visualize some financial indicators. The analysis part is the most critical part of the actual use case, but I skip a detailed explanation since this is out of the scope of this article.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3Oq-ActfneRz1XQn0RX3zQ.png" /><figcaption>Visualize the MACD of the ETF</figcaption></figure><h4>Add widgets for Mercury</h4><p>To add widgets to the dashboard, you must add a particular cell on top of the notebook. You can check the official documentation if you want to know the detail.</p><p><a href="https://github.com/mljar/mercury">https://github.com/mljar/mercury</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*uqWj7cpPC1Ki0hRxhzJ8Iw.png" /><figcaption>Config of the widgets</figcaption></figure><p>I lay out some slider widgets in this dashboard, and the configuration will be generated like this.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/455/1*0Squ05xJvW0jbsJ8Ziiwbg.png" /><figcaption>Actual widgets</figcaption></figure><h4>Commit your work</h4><p>You can easily git commit your work in this tab.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/440/1*2TZoPE2i76lllwSgLTNkTA.png" /></figure><h4>Push your commit to GitHub</h4><p>Once you finish your work, you can push your work to your GitHub. You must generate an access token in GitHub to git push the commits to the remote repository.</p><p><a href="https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token">https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/765/1*2_icSqKumwH2Jn80rVYdJQ.png" /></figure><h4>Create Dockerfile and requirements.txt</h4><p>In lines 5 to 12, I install the &quot;TA-LIB&quot; library for stock analysis, so you can skip this part if you don&#39;t need it.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/920/1*Y8ULtpkdvpidmvCfAnEGPg.png" /></figure><h4>Create ECR private repository</h4><p>Before creating GitHub Actions, you need to create an ECR repository. The name of your repository will be used in GitHub Actions.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*HXUUySkJ2xPm5Y7UCN3Qxg.png" /></figure><h4>Create yaml for GitHub actions</h4><p>This configuration is set to execute when the master branch is changed. You can check the official documentation if you want to know the detail. Don&#39;t forget to set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_ECR_REPO_NAMEas the environmental variables in your repository. You might need to create an IAM account with appropriate privileges to generate an access key.</p><p><a href="https://github.com/aws-actions/amazon-ecr-login">https://github.com/aws-actions/amazon-ecr-login</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/893/1*PcF0KTiIouCJjw3xW2qzzQ.png" /></figure><h4>Create App runner</h4><p>Lastly, let&#39;s create an App Runner instance to host your application. The deployment will be started once you set up the service&#39;s configuration. It takes several minutes.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*l55uaa4a5vcNuifp39lhSg.png" /></figure><h4>Access Dashboard</h4><p>After all of the work is finished, you can access your dashboard by clicking the link described in your App Runner service.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*lVp6XXE35fF82wHW43P-bw.png" /></figure><p>You can set VPC and security groups to set access control.</p><h3>Conclusion</h3><p>I introduced an automated dashboarding system using SageMaker and Mercury. In addition, GitHub Actions can realize CI/CD, so your code change will automatically be reflected on the dashboard. I hope this article helps.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ebe5791a5fe7" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Python Streamlit in Practice; A Use-Case of Visualizing Stock Data]]></title>
            <link>https://koyaaarr.medium.com/python-streamlit-in-practice-a-use-case-of-visualizing-stock-data-20ec1e1c8478?source=rss-69519ff4d58c------2</link>
            <guid isPermaLink="false">https://medium.com/p/20ec1e1c8478</guid>
            <category><![CDATA[python]]></category>
            <category><![CDATA[stocks]]></category>
            <category><![CDATA[streamlit]]></category>
            <dc:creator><![CDATA[Ryo Koyajima / 小矢島 諒]]></dc:creator>
            <pubDate>Mon, 27 Jun 2022 17:11:36 GMT</pubDate>
            <atom:updated>2022-06-28T04:10:45.484Z</atom:updated>
            <content:encoded><![CDATA[<p>~How to create an ETF portfolio simulator in Python Streamlit~</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*tpZuypCypENZw1eq8ijWCA.png" /><figcaption><a href="https://koyaaarr-invest-analytics-ui-app-blie7w.streamlitapp.com/">https://koyaaarr-invest-analytics-ui-app-blie7w.streamlitapp.com/</a></figcaption></figure><p>Streamlit is becoming one of the great options for creating demos only using Python. I will explain how to create an effective dashboard using Streamlit and deploy Streamlit Cloud with actual stock data.</p><h3>Quick Demo</h3><p>Here is my demo of the ETF simulator deployed in Streamlit Cloud.</p><p><a href="https://koyaaarr-invest-analytics-ui-app-blie7w.streamlitapp.com/">https://koyaaarr-invest-analytics-ui-app-blie7w.streamlitapp.com/</a></p><p>The source code is here.</p><p><a href="https://github.com/koyaaarr/invest-analytics-ui">https://github.com/koyaaarr/invest-analytics-ui</a></p><p>Let me explain how to build the dashboard one by one.</p><h3>Development Environment</h3><p>Before that, I want to mention the development environment. Preparing an organized environment is essential to develop faster and more steadily.</p><p>I recommend using <strong>Visual Studio Code </strong>for coding and <strong>Poetry</strong> for managing Python and its libraries. I wrote an article about building the environment, so please take a look if you want.</p><p><a href="https://medium.com/codex/python-development-setup-for-data-scientists-2022-7f80b2018402">https://medium.com/codex/python-development-setup-for-data-scientists-2022-7f80b2018402</a></p><h3>Data Processing</h3><p>First, we must process raw stock data into the appropriate format to visualize them.</p><p>For instance, you need to calculate portfolio value by multiplying each ETF’s value and the quantity you have.</p><p>I will visualize this information in the dashboard, so each graph needs correct format data.</p><ul><li>Time-series change of overall portfolio value</li><li>Time-series change of the Sharpe ratio</li><li>Each stock’s ratio in my portfolio</li></ul><p>I don’t explain the data processing part in detail. To see this part, you can check my Jupyter Notebook in my source code.</p><p><a href="https://github.com/koyaaarr/invest-analytics-ui/blob/master/notebooks/quick_look.ipynb">https://github.com/koyaaarr/invest-analytics-ui/blob/master/notebooks/quick_look.ipynb</a></p><h3>Visualization</h3><p>Once you prepare the data, let’s visualize each of them. I use <a href="https://plotly.com/"><strong>Plotly</strong></a>, a good-looking graph library for Python.</p><h4><strong>Time-series change of overall portfolio value</strong></h4><p>It’s straightforward to plot the line chart in Plotly. You can do that with two lines of code (except import expression).</p><pre># you can run this code in Jupyter Notebook<br># stocks: dataframe contains daily stock value<br># Date: date like &#39;2022-06-01&#39;<br># Close_Portfolio: calculated value of portfolio on the date</pre><pre>import plotly.express as px<br>fig = px.line(stocks, x=&quot;Date&quot;, y=&quot;Close_Portfolio&quot;)<br>fig.show()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/792/1*mx-t15vP-kwCg9qw1lQ5Gg.png" /><figcaption>plot portfolio</figcaption></figure><p>There is a blank space in 2018 because these values in this period are null. Thus, drop these rows to plot only valid values. In addition, I don’t need grid lines in the graph so omit them.</p><pre>fig = px.line(stocks.dropna(subset=[&#39;Close_Portfolio&#39;]), x=&quot;Date&quot;, y=&quot;Close_Portfolio&quot;)<br>fig.update_xaxes(showgrid=False, zeroline=False)<br>fig.update_yaxes(showgrid=False, zeroline=False)<br>fig.show()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/792/1*M4NQ_WRc4ji6J-S6tR10TQ.png" /><figcaption>plot portfolio (improved)</figcaption></figure><p>This graph is much better than the former one.</p><h4><strong>Time-series change of the Sharpe ratio</strong></h4><p>You can plot the Sharpe ratio the same way with the portfolio. It is said that your portfolio is good if the Sharpe ratio is greater than 1. Therefore, add a baseline to the chart.</p><pre># you can run this code in Jupyter Notebook<br># sharpe: dataframe contains daily sharpe ratio value<br># Date: date like &#39;2022-06-01&#39;<br># sharpe_ratio_annual: calculated value of sharpe ratio on the date</pre><pre>fig = px.line(sharpe.dropna(subset=[&#39;sharpe_ratio_annual&#39;]), x=&quot;Date&quot;, y=&quot;sharpe_ratio_annual&quot;)<br><strong>fig.add_hline(1, line_color=&quot;red&quot;)</strong><br>fig.update_xaxes(showgrid=False, zeroline=False)<br>fig.update_yaxes(showgrid=False, zeroline=False)<br>fig.show()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Y4BzsNVlMB53HDwDhGhkBQ.png" /><figcaption>plot Sharpe ratio</figcaption></figure><p>Looks good!</p><h4><strong>Each stock’s ratio in my portfolio</strong></h4><p>It’s good to use a pie chart to see the ratio of my portfolio. However, I want to see the portion of each asset’s type and the balance. The diversity of the asset type(e.g., stock, bond, commodity) is vital for our portfolio management. Therefore, I will use a sunburst chart this time.</p><pre># you can run this code in Jupyter Notebook<br># ratio: dataframe contains tickers and those ratio<br># type: asset type like &#39;stock&#39; or &#39;bond&#39;<br># ticker: asset name like &#39;VOO&#39; or &#39;BTC-USD&#39;<br># ratio_percent: each ticker&#39;s ratio [percent]</pre><pre>fig = px.sunburst(<br>ratio,<br>path=[&quot;type&quot;, &quot;ticker&quot;],<br>values=&quot;ratio_percent&quot;,<br>title=&quot;Portfolio Recent Value Ratio&quot;<br>)<br>fig.show()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ErYXlEMf4Noa27TpbZB8eQ.png" /><figcaption>the plot ratio of each asset in my portfolio</figcaption></figure><p>Now we can see each asset’s ratio and type of asset’s ratio.</p><h3>Organize Dashboard</h3><p>Each graph is prepared now, so let’s place them on the dashboard. I will use <a href="https://streamlit.io/"><strong>Streamlit</strong></a> to create a dashboard. This dashboard consists of the following functions.</p><ul><li>Load initial data</li><li>Process data</li><li>Visualize graphs</li><li>Arrange components</li></ul><h4>Load initial data</h4><p>Load our data generated in the processing part as initial data.</p><p>@st.cache can preserve the result of the function so that you don’t have to load this function every time. Streamlit executes the whole program, so we should use these features to reduce the cost of a re-run.</p><pre><strong>@st.cache</strong><br>def read_stock_data_from_local():<br>  stocks = pd.read_pickle(&quot;data/stocks.pkl&quot;)<br>  ratio = pd.read_pickle(&quot;data/ratio.pkl&quot;)<br>  sharpe = pd.read_pickle(&quot;data/sharpe.pkl&quot;)<br>return sharpe, stocks, ratio</pre><h4>Process data</h4><p>Processing data is needed when we push the calculate button. Time-series data of the portfolio and its Sharpe ratio and the ratio of each asset will be calculated.</p><p>This component’s code is complicated, so I describe the whole program. Here is the primary process of this function.</p><pre># num_holds(dict): assets and each number of holds<br># stocks(dataframe): daily close value of each asset<br># ratio(dataframe): each asset ratio in portfolio at the most recent date<br># sharpe(dataframe): daily sharpe ratio<br># portfolio(dict): detail of each asset</pre><pre>def calc_stock(num_holds, stocks, ratio, portfolio):<br>  # calc portfolio value<br>  stocks[&quot;Close_Portfolio&quot;] = stocks.apply(<br>lambda x: calc_portfolio(x, num_holds), axis=1)<br>  ~~</pre><pre>  # calc sharpe ratio<br>  sharpe = stocks.loc[:, [&quot;Date&quot;, &quot;Close_Portfolio&quot;]]<br>  ~~~</pre><pre>  # calc recent value ratio<br>  ratio = pd.DataFrame(data={&quot;ticker&quot;: portfolio[&quot;ticker&quot;].keys(), &quot;ratio_percent&quot;: recent_values})<br>  ~~~</pre><pre>return sharpe, stocks, ratio</pre><h4>Visualize graphs</h4><p>Visualize three graphs; portfolio, Sharpe ratio, and the ratio of assets.</p><p>Streamlit can use charts generated by Plotly only using st.plotly_chart function.</p><pre># plot sharpe ratio<br>fig = px.line(sharpe, x=&quot;Date&quot;, y=&quot;sharpe_ratio_annual&quot;)<br>fig.add_hline(1, line_color=&quot;red&quot;)<br>fig.update_xaxes(showgrid=False, zeroline=False)<br>fig.update_yaxes(showgrid=False, zeroline=False)<br><strong>st.plotly_chart(fig, use_container_width=True)</strong></pre><h4>Arrange components</h4><p>Place each component(e.g., title, button, input form, and graphs) using Streamlit.</p><p>Arranging components is intuitive so that you can place each part easily. If you want to see examples, visit <a href="https://streamlit.io/gallery?category=science-technology">https://streamlit.io/gallery?category=science-technology</a>.</p><p>One tip to make use of Streamlit is that you can use st.session_state to preserve some variables.</p><p>For instance, you can save your current size if you want to see a stock line chart with multiple window sizes. Streamlit always re-runs everything, so you can not keep your state unless you use the variable.</p><p>You can write a code like this in that case.</p><pre># place four button with different window size<br># if you push &quot;Year&quot; button, then window size is saved as 360</pre><pre>if st.button(&quot;3Year&quot;, key=&quot;portfolio&quot;):<br>  st.<strong>session_state.window_size</strong> = 1080<br>if st.button(&quot;Year&quot;, key=&quot;portfolio&quot;):<br>  st.<strong>session_state.window_size</strong> = 360<br>if st.button(&quot;Quarter&quot;, key=&quot;portfolio&quot;):<br>  st.<strong>session_state.window_size</strong> = 90<br>if st.button(&quot;Month&quot;, key=&quot;portfolio&quot;):<br>  st.<strong>session_state.window_size</strong> = 30</pre><p>This code looks like this.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/800/1*8uxZIX8jArNcxeZoEJ6S9w.gif" /><figcaption>Save window size as session state</figcaption></figure><h3>Deploy Streamlit Cloud</h3><p>Finally, deploy our program to cloud service to share our apps with people. I will use <a href="https://streamlit.io/cloud">Streamlit Cloud</a>.</p><p>All you have to do is select your repository and configure some settings after signing up.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*lm7GqOvLA0TOAYOxSiajgQ.png" /><figcaption>Configure deploy settings</figcaption></figure><p>Streamlit Cloud will install the libraries automatically and deploy our app immediately (if you use Poetry or some package management library).</p><p>After a while, our app will be deployed like this.</p><p><a href="https://koyaaarr-invest-analytics-ui-app-blie7w.streamlitapp.com/">https://koyaaarr-invest-analytics-ui-app-blie7w.streamlitapp.com/</a></p><p>You can choose the privacy of your app by these settings.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*lWb6RXdRdmu5ukp1HXdaGg.png" /><figcaption>Manage option</figcaption></figure><h3>Conclusion</h3><p>In this article, I explained how to process the stock data, visualize them, organize the dashboard, and deploy them to the cloud. I hope this article helps you.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=20ec1e1c8478" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Python Development Setup for Data Scientists in 2022]]></title>
            <link>https://medium.com/codex/python-development-setup-for-data-scientists-2022-7f80b2018402?source=rss-69519ff4d58c------2</link>
            <guid isPermaLink="false">https://medium.com/p/7f80b2018402</guid>
            <category><![CDATA[python]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[visual-studio-code]]></category>
            <dc:creator><![CDATA[Ryo Koyajima / 小矢島 諒]]></dc:creator>
            <pubDate>Sat, 18 Jun 2022 08:30:32 GMT</pubDate>
            <atom:updated>2022-06-24T14:52:14.904Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*r2-ip_usVartA4zC" /><figcaption>Photo by <a href="https://unsplash.com/@sadswim?utm_source=medium&amp;utm_medium=referral">ian dooley</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><p>There are a lot of useful tools and libraries appearing in recent years. Some don&#39;t seem to be famous among data scientists, while engineers often use them. Thus, I want to introduce some tools to data scientists new to Python or software development. In this article, I will show my favorite Python development tools to do data science.</p><p><strong>I intend to introduce data scientists who want to …</strong></p><ul><li>use both Mac and Windows (WSL)</li><li>deploy code to cloud services like Google Cloud Run</li><li>handle several projects simultaneously</li><li>manage environmental setting by Git</li></ul><h4><strong>Table of Content</strong></h4><ul><li><strong>Visual Studio Code</strong>(vscode); free and useful editor</li><li><strong>Peacock</strong>; color schema manager <strong>[Recommended]</strong></li><li><strong>Rainbow CSV</strong>; coloring CSV file</li><li><strong>autoDocstring</strong>; document generator</li><li><strong>pyenv</strong>; version manager</li><li><strong>Poetry</strong>; powerful package manager <strong>[Recommended]</strong></li><li><strong>Black</strong>, <strong>Flake8</strong>, <strong>isort</strong>, and<strong> Mypy</strong>; formatter and linter</li></ul><h4><strong>Visual Studio Code(vscode); free and useful editor</strong></h4><p><a href="https://code.visualstudio.com/">https://code.visualstudio.com/</a></p><p>Visual Studio Code(vscode) is one of the most famous editors.<br>Vscode is also for data scientists because we can use Jupyter Notebooks in vscode and Python files. You don&#39;t have to code in browsers anymore.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*NVPRTLVFvexy7c5Q2h-Giw.png" /><figcaption>Jupyter Notebook in vscode (Image by author)</figcaption></figure><h4><strong>Peacock</strong>; color schema manager</h4><p><a href="https://marketplace.visualstudio.com/items?itemName=johnpapa.vscode-peacock">Peacock - Visual Studio Marketplace</a></p><p>Peacock is one of my favorite extensions in vscode. <br>You can change the color schema with Peacock by the following steps.</p><ul><li>&quot;Ctrl(Command) + Shift + P&quot; in vscode</li><li>type &quot;Peacock: Change to a Favorite Color&quot;</li><li>select your favorite one</li></ul><p>Of course, you can set up your color schema by typing &quot;Peacock: Enter a Color&quot; and inputting the hex code.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*SNFPRUZhWIcWrdU9Y1baNA.png" /><figcaption>Select your favorite color (Image by author)</figcaption></figure><blockquote>Advantages for data scientist:<br>When you work on several projects simultaneously, peacock is quite dependable.<br>It is because <strong>you distinguish the project by its looking so that you can prevent mix-up projects</strong>.<br>In addition, you can control the color schema with Git so you can use the same color with different computers.</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*speMFSyiK_gxVO8atXFTKA.png" /><figcaption>You can distinguish the project you want to work on (Image by author)</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_pxpBEN4KhZzr-2yL5wp3Q.png" /><figcaption>You can control the color by Git (Image by author)</figcaption></figure><h3><strong>Rainbow CSV</strong>; coloring CSV file</h3><p><a href="https://marketplace.visualstudio.com/items?itemName=mechatroner.rainbow-csv">https://marketplace.visualstudio.com/items?itemName=mechatroner.rainbow-csv</a></p><p>If you are a data scientist, you have a lot of chances to see CSV files. Rainbow CSV can colorize your CSVs in each column. Excel is a good tool for seeing CSV, but it takes much time to open the files. Try this extension if you want to see CSV at a glance.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ssbxyUetHRrWEOLsmGeVSQ.png" /><figcaption>Colorizing dataset (Image by author)</figcaption></figure><h4><strong>autoDocstring</strong>; document generator</h4><p><a href="https://marketplace.visualstudio.com/items?itemName=njpwerner.autodocstring">https://marketplace.visualstudio.com/items?itemName=njpwerner.autodocstring</a></p><p>autoDocstring is a document generator that helps you to write maintainable code. Once you define the arguments and return values in your method, this extension generates the document template.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/800/1*ciT6LvQN9-0YReojlKhMJA.gif" /><figcaption>type double quotation three times, then the document will be generated (Image by author)</figcaption></figure><h4><strong>pyenv</strong>; version manager</h4><p><a href="https://github.com/pyenv/pyenv">https://github.com/pyenv/pyenv</a></p><p>pyenv is a famous version manager for Python. To install on Mac, you can use brew install pyenvcommand. If you are a Windows user, try the following commands.</p><pre>git clone https://github.com/pyenv/pyenv.git ~/.pyenv<br>echo &#39;export PYENV_ROOT=&quot;$HOME/.pyenv&quot;&#39; &gt;&gt; ~/.bashrc<br>echo &#39;command -v pyenv &gt;/dev/null || export PATH=&quot;$PYENV_ROOT/bin:$PATH&quot;&#39; &gt;&gt; ~/.bashrc<br>echo &#39;eval &quot;$(pyenv init -)&quot;&#39; &gt;&gt; ~/.bashrc</pre><p>Then install a specific version(e.g., 3.9.11) of Python.</p><pre>pyenv install 3.9.11</pre><p>I recommend that you designate the version in the working directory by this command.</p><pre>pyenv local 3.9.11</pre><p>You will find the file generated by the command so that you can control the Python version in Git.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/416/1*vcu_fFeU2lqqaSddDsnhxQ.png" /><figcaption>pyenv generates version file (Image by author)</figcaption></figure><h4><strong>Poetry</strong>; a powerful package manager</h4><p><a href="https://github.com/python-poetry/poetry">GitHub - python-poetry/poetry: Python packaging and dependency management made easy</a></p><p>Poetry is a Python library manager that can solve between libraries. Compared to pip, Poetry can manage libraries more smartly. This separates libraries into two types; one is the list you want to install, and the other is the list of whole libraries used by the former. (Just like npm module in Javascript)</p><p>For instance, if you install pandas with poetry, it is defined in the former file, and whole packages are described in the latter.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/652/1*qpmapt3FHMaai6Y2p0d9GQ.png" /><figcaption>Former defines only pandas and Python itself (Image by author)</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*HVrpteor2W5fphzhVnIluA.png" /><figcaption>Latter describes all the packages that are used by pandas (Image by author)</figcaption></figure><p>These files are automatically updated when you install new packages. <strong>You don&#39;t need to do the pip freeze command anymore.</strong></p><p>Moreover, Poetry can generate a virtual environment so that you can execute Python in an isolated environment. Therefore, <strong>you don&#39;t need to worry about unintended dependencies.</strong></p><p>Here is a quick start to Poetry.</p><pre>$ pip install poetry # install Poetry<br>$ poetry config virtualenvs.in-project true --local # generate venv in working directory<br>$ poetry init # initial settings of Poetry<br>$ poetry add pandas # install package e.g. pandas<br>$ poetry shell # launch virtual environment</pre><p>If you&#39;ve installed Poetry, don&#39;t forget to set Poetry&#39;s virtual environment as the default interpreter of your vscode.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*24LGmYukzafz7F3RRv0WeA.png" /><figcaption>select poetry virtual environment (Image by author)</figcaption></figure><p>Once you’ve set up poetry and control pyproject.toml , poetry.lock , and poetry.toml by Git, you can use and share with your teammate the same environment you’ve created.</p><h4><strong>Black</strong>, <strong>Flake8</strong>, <strong>isort, and Mypy</strong>; formatter and linter</h4><ul><li><a href="https://github.com/psf/black">GitHub - psf/black: The uncompromising Python code formatter</a></li><li><a href="https://github.com/PyCQA/flake8">GitHub - PyCQA/flake8: flake8 is a python tool that glues together pycodestyle, pyflakes, mccabe, and third-party plugins to check the style and quality of some python code.</a></li><li><a href="https://github.com/PyCQA/isort">GitHub - PyCQA/isort: A Python utility / library to sort imports.</a></li><li><a href="https://github.com/python/mypy">GitHub - python/mypy: Optional static typing for Python</a></li></ul><p>These packages faster your coding and realize neat programs.</p><p>These are only used in a development environment so that you can install them with -D option.</p><pre>poetry add -D black flake8 isort mypy</pre><p>Then modify vscode settings via settings.json. You can enable the above linters and formatters explicitly.</p><pre>&quot;python.formatting.provider&quot;: &quot;black&quot;,<br>&quot;python.linting.flake8Enabled&quot;: true,<br>&quot;[python]&quot;: {<br>&quot;editor.codeActionsOnSave&quot;: {<br>&quot;source.organizeImports&quot;: true<br>},<br>&quot;python.linting.mypyEnabled&quot;: true,</pre><h4>Conclusion</h4><p>I&#39;ve introduced several valuable tools for data scientists to set up a Python environment. I uploaded sources in this repository(<a href="https://github.com/koyaaarr/python-setup">https://github.com/koyaaarr/python-setup</a>).</p><p>I hope this article is helpful to you.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7f80b2018402" width="1" height="1" alt=""><hr><p><a href="https://medium.com/codex/python-development-setup-for-data-scientists-2022-7f80b2018402">Python Development Setup for Data Scientists in 2022</a> was originally published in <a href="https://medium.com/codex">CodeX</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[dbt and BigQuery in Practice; A Use-Case of Transforming Stock Data]]></title>
            <link>https://blog.devgenius.io/dbt-and-bigquery-in-practice-transform-stock-data-1771e2393319?source=rss-69519ff4d58c------2</link>
            <guid isPermaLink="false">https://medium.com/p/1771e2393319</guid>
            <category><![CDATA[bigquery]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[dbt]]></category>
            <category><![CDATA[google-cloud-platform]]></category>
            <category><![CDATA[data-engineering]]></category>
            <dc:creator><![CDATA[Ryo Koyajima / 小矢島 諒]]></dc:creator>
            <pubDate>Thu, 02 Jun 2022 10:44:46 GMT</pubDate>
            <atom:updated>2022-12-27T13:59:43.661Z</atom:updated>
            <content:encoded><![CDATA[<p>Updated 2022/7/22: update data pipeline as follows;</p><ul><li>create an additional warehouse to store calculated portfolio value<br>-&gt; to isolate each data mart to avoid being affected by changes in each mart</li></ul><p>Updated 2022/6/11: pushed source code to GitHub: <a href="https://github.com/koyaaarr/invest-analytics-model">https://github.com/koyaaarr/invest-analytics-model</a></p><h3>Introduction</h3><p>This article explains how to use dbt and BigQuery to transform actual data.</p><p>You can easily create a data lake, data warehouse, and data mart using dbt. It also enables us to test our data quality. I will combine BigQuery with dbt to transform actual stock data into a data mart used by my dashboard.</p><p>This article relates to the following one, so please read it if you have some time.</p><p><a href="https://koyaaarr.medium.com/a-practical-use-case-of-cloud-native-and-secured-dashboard-with-google-cloud-and-python-streamlit-a66e60d62ca8">https://koyaaarr.medium.com/a-practical-use-case-of-cloud-native-and-secured-dashboard-with-google-cloud-and-python-streamlit-a66e60d62ca8</a></p><p>Then, let&#39;s get started.</p><h3><strong>Modeling</strong></h3><p>Before getting into the transformation, we need to define the data schemas of each table.</p><p>Here is the image of the tables we need.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Gy4RE4DY28Zqla8KHB3suw.png" /><figcaption>data pipeline</figcaption></figure><p>On the data mart side, I want to see the overall performance of my portfolio, and each stock ratio consists of that. Therefore, two data marts are needed to create for these purposes.</p><p>On the other hand, each stock data(VOO, BTC-USD, BND) is stored in Google Cloud Storage. Their format is CSV and contains dates and values like closing price.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/972/1*Dpci8FRdFcN9WDgFY_JW3g.png" /><figcaption>Source Stock Data</figcaption></figure><p>Therefore, I need to aggregate those data sources into the data warehouse and transform them into each data mart.</p><p>Each data schema is described following section.</p><h3><strong>Introducing dbt and BigQuery</strong></h3><p>Here are the prerequisites of this use case. I will use dbt CLI and install using Python.</p><pre>Python: 3.9.11<br>dbt-core: 1.1.0<br>dbt-bigquery: 1.1.0</pre><p>First of all, you can initialize dbt by the following command.</p><pre>dbt init</pre><p>This command creates a lot of files and directories.</p><p>This command creates a lot of files and directories.</p><p>Then you can make &quot;profiles.yml&quot; in the same directory as &quot;dbt_project.yml&quot;. This file is generated in &quot;~/.dbt&quot; by default, but I recommend you make this in your working directory to control by git.</p><p>In the beginning, you will edit &quot;models/&quot;, &quot;dbt_project.yml&quot;, and &quot;profiles.yml&quot;.</p><p>Let&#39;s take a look at each file.</p><p>&quot;dbt_project.yml&quot; defines the configuration of the project. You will edit the bottom of this file. There are tables we create, and you can specify each table&#39;s materialization types.</p><pre>name: &#39;invest_analytics&#39;<br>version: &#39;1.0.0&#39;<br>config-version: 2</pre><pre>~~~</pre><pre>models:<br>  invest_analytics:<br>    invest_analytics_dev:<br>    +materialized: view<br>      warehouse-date:<br>      warehouse-stock:<br>      warehouse-num-hold:<br>      warehouse-portfolio:<br>      mart-portfolio-value:<br>        +materialized: table<br>      mart-portfolio-ratio:<br>        +materialized: table</pre><p>&quot;profiles.yml&quot; defines system configuration, including connection with BigQuery. If you authenticate using a service account, you need to designate the key file.</p><pre>invest_analytics:</pre><pre>outputs:</pre><pre>dev:<br>  dataset: invest_analytics_dev<br>  job_execution_timeout_seconds: 300<br>  job_retries: 1<br>  keyfile: ../service_account.json<br>  location: asia-northeast1<br>  method: service-account<br>  priority: interactive<br>  project: invest-analytics-347211<br>  threads: 1<br>  type: bigquery<br>  target: dev</pre><p>&quot;models&quot; directory contains SQLs and &quot;schema.yaml&quot;.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/552/1*eGzrRWDTAJV-U3uQE8s_Lg.png" /></figure><p>You can write standard SQL in dbt, but the only different thing is its source table.</p><p>You need to define the referenced table with dbt&#39;s format like this instead of the ordinal format.</p><pre>select<br>    Date<br>  , cast(close_voo as integer) as close_voo<br>  , cast(close_btcusd as integer) as close_btcusd<br>  , cast(close_bnd as integer) as close_bnd<br>  , cast(close_total as integer) as close_total<br>from<br>  {{ ref(&#39;warehouse-stock&#39;) }} as st<br>  left outer join {{ ref(&#39;warehouse-portfolio&#39;) }} as pf <br>    on st.Date = pf.Date<br>order by<br>  Date</pre><p>If your table is generated by source data like CSV, you can define source data like this.</p><pre>select<br>    max(case ticker when &#39;VOO&#39; then num_of_hold else null end) as num_voo<br>  , max(case ticker when &#39;BTC-USD&#39; then num_of_hold else null end) as num_btcusd<br>  , max(case ticker when &#39;BND&#39; then num_of_hold else null end) as num_bnd<br>from<br>  {{ source(&#39;invest_analytics_dev&#39;, &#39;source-portfolio&#39;) }}</pre><p>Finally, you need to define the data schemas in &quot;schema.yaml&quot; like this.</p><pre>version: 2<br>sources:<br>  - name: invest_analytics_dev<br>    tables:<br>      - name: source-voo<br>      - name: source-btcusd<br>      - name: source-bnd<br>      - name: source-portfolio</pre><pre>models:<br>  - name: mart-portfolio-ratio<br>    description: &#39;&#39;<br>    columns:<br>      - name: ticker<br>        description: &#39;&#39;<br>        tests:<br>          - unique<br>          - not_null<br>          - accepted_values:<br>            values: [&#39;voo&#39;, &#39;btcusd&#39;, &#39;bnd&#39;]<br>      - name: close_percent<br>        description: &#39;&#39;<br>        tests:<br>          - unique<br>          - not_null</pre><p>If you have data sources imported from CSV files, you can write them in the &quot;sources&quot; part.</p><p>Then you can add your tables&#39; data schema. In addition, you can define tests for each column. This example contains the &quot;uniqueness test&quot;, &quot;not null test&quot;, and &quot;accepted values test&quot;.</p><p>Once you&#39;ve finished defining each file, let&#39;s generate tables by this command.</p><pre>dbt run — full-refresh — profiles-dir .</pre><p>Then you get the result like this.</p><pre>12:11:02  Running with dbt=1.1.0<br>12:11:02  Unable to do partial parsing because a project config has changed<br>12:11:03  Found 5 models, 17 tests, 0 snapshots, 0 analyses, 191 macros, 0 operations, 0 seed files, 4 sources, 0 exposures, 0 metrics<br>12:11:04  Concurrency: 1 threads (target=&#39;dev&#39;)<br>12:11:04  1 of 5 START view model invest_analytics_dev.warehouse-date .................... [RUN]<br>12:11:06  1 of 5 OK created view model invest_analytics_dev.warehouse-date ............... [OK in 1.61s]</pre><pre>~~~</pre><pre>12:11:11  5 of 5 START table model invest_analytics_dev.mart-portfolio-ratio ............. [RUN]<br>12:11:14  5 of 5 OK created table model invest_analytics_dev.mart-portfolio-ratio ........ [CREATE TABLE (3.0 rows, 62.7 KB processed) in 3.27s]<br>12:11:14  Finished running 3 view models, 2 table models in 11.36s.<br>12:11:14  Completed successfully<br>12:11:14  Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5</pre><p>You can see each table is created in Google Cloud Console.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*o9--WU7ZwoaYsfwLxjIDWw.png" /><figcaption>BigQuery console</figcaption></figure><p>If you want to check the quality of the data, run this command.</p><pre>dbt test — profiles-dir .</pre><p>Then, you get the result like this.</p><pre>10:04:03  Running with dbt=1.1.0<br>10:04:03  Found 5 models, 17 tests, 0 snapshots, 0 analyses, 191 macros, 0 operations, 0 seed files, 4 sources, 0 exposures, 0 metrics<br>10:04:04  Concurrency: 1 threads (target=&#39;dev&#39;)<br>10:04:04  1 of 17 START test accepted_values_mart-portfolio-ratio_ticker__voo__btcusd__bnd  [RUN]<br>10:04:06  1 of 17 PASS accepted_values_mart-portfolio-ratio_ticker__voo__btcusd__bnd ..... [[32mPASS[0m in 2.16s]</pre><pre>~~~</pre><pre>10:04:31  17 of 17 START test unique_warehouse-stock_Date ................................ [RUN]<br>10:04:32  17 of 17 PASS unique_warehouse-stock_Date ...................................... [[32mPASS[0m in 1.33s]<br>10:04:32  Finished running 17 tests in 28.88s.<br>10:04:32  Completed successfully<br>10:04:32  Done. PASS=17 WARN=0 ERROR=0 SKIP=0 TOTAL=17</pre><p>Lastly, let&#39;s generate the document of our tables by the following command.</p><pre>dbt docs generate — profiles-dir .<br>dbt docs serve — profiles-dir .</pre><p>Then you can see the table definitions and lineage.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*vEYg8lKyyjz4vhxbTyHyCA.png" /><figcaption>table definition</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*SjrUzJDJTuP_UHDlPE7k9Q.png" /><figcaption>lineage graph</figcaption></figure><h3><strong>Conclusion</strong></h3><p>I hope you can find this article helpful. I explained the actual use case of dbt and BigQuery with stock data. You can create tables according to their dependencies, test data quality, and even generate the definition documents.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=1771e2393319" width="1" height="1" alt=""><hr><p><a href="https://blog.devgenius.io/dbt-and-bigquery-in-practice-transform-stock-data-1771e2393319">dbt and BigQuery in Practice; A Use-Case of Transforming Stock Data</a> was originally published in <a href="https://blog.devgenius.io">Dev Genius</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[A Practical Use-Case of Cloud-Native and Secured Dashboard with Google Cloud and Python Streamlit]]></title>
            <link>https://koyaaarr.medium.com/a-practical-use-case-of-cloud-native-and-secured-dashboard-with-google-cloud-and-python-streamlit-a66e60d62ca8?source=rss-69519ff4d58c------2</link>
            <guid isPermaLink="false">https://medium.com/p/a66e60d62ca8</guid>
            <category><![CDATA[streamlit]]></category>
            <category><![CDATA[google-cloud-run]]></category>
            <category><![CDATA[google-cloud-platform]]></category>
            <category><![CDATA[python]]></category>
            <dc:creator><![CDATA[Ryo Koyajima / 小矢島 諒]]></dc:creator>
            <pubDate>Wed, 25 May 2022 12:13:59 GMT</pubDate>
            <atom:updated>2022-06-02T10:46:22.375Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/800/1*6G_gsddIBOvKT0iMyVJ8Ow.gif" /><figcaption>Demo</figcaption></figure><h3><strong>Introduction</strong></h3><p>With rising cloud services and data scientist-friendly visualization tools, building a dashboard is getting easier and faster.</p><p>However, it’s also becoming more and more complicated to understand or utilize them.</p><p>This article will show the use-case of combining these technologies by building a secured dashboard managing my investment portfolio.</p><p>This article explains the application from three perspectives; business, data science, and engineering. These are often defined as essential skills in data science. Therefore, I intend to break down my explanation into these sections so you can read them in which you’re interested.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/528/1*l_E14BVgAXk82o1YbtCW8Q.png" /><figcaption>Data Science Skill’s Venn Diagram</figcaption></figure><h3><strong>Business Persipective: Requierments</strong></h3><p>Though this article focuses on technology, it wouldn’t be convincing if my app is not unpractical(even if this is only for personal use).</p><p>Therefore, I will define some requirements before the implementation.</p><p>By the way, I’ve bought some ETFs monthly, but I’m not sure what’s going on in my portfolio. This is because the prices of ETFs are varied and go up and down day by day. In addition, I don’t check my portfolio frequently because I don’t want to spend much time watching the stock markets. These things remind me of creating an app satisfying the following requirements.</p><p>1. show specific ETFs I’m interested in to see whether each stock is a bargain or not<br>2. show the current value of my portfolio to check how good or bad<br>3. show the ratio of the types of ETFs (e.g., stock/bond/commodity) to help me to decide whether I need to rotate my portfolio according to the best ratio of the types of assets<br>4. update daily because I’ll check this daily at most<br>5. authentication is required to hide my tangible assets(This is IMPORTANT)!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Ga7bihUfRdMKqV2hyDpeqQ.png" /><figcaption>What I want to see</figcaption></figure><p>In addition to the above requirements, UI should be handy but provide sufficient information. Just between iPhone’s stock app and TradingView is ideal for me.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*YOiEiMhfrah_GUclOIiTkw.png" /><figcaption>Target Position</figcaption></figure><h3><strong>Data Science Perspective: Data Modeling and Build Data Pipeline</strong></h3><p>I need to prepare a data mart for my dashboard to meet the above requirements. The data mart is one of the concepts in the data model, and this also includes the data warehouse and the data lake. These concepts have different purposes so let me explain them in the following table.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*zX2r1mZJ1DWTHsmaV3MQrA.png" /><figcaption>Data Model</figcaption></figure><p>There are two types of visualization needed, so I will create two data marts and a data warehouse that can provide enough data for data marts.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Gy4RE4DY28Zqla8KHB3suw.png" /><figcaption>Data Pipeline</figcaption></figure><p>Now let’s get into the data schema of data marts. The first data mart is to plot a line chart of my portfolio and stocks, so historical values need to be prepared. The second one is to plot a pie chart of the ratio of my portfolio, so each stock’s ratio needs to be calculated.</p><p><strong>Calculate Portfolio Value</strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/730d364002a80ba278eba29837ab0004/href">https://medium.com/media/730d364002a80ba278eba29837ab0004/href</a></iframe><p><strong>Calculate Portfolio Ratio</strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/3993d05909ff9416b0f3c342471321a0/href">https://medium.com/media/3993d05909ff9416b0f3c342471321a0/href</a></iframe><p>The data modeling in detail is omitted due to space limitations. In the next article, I will introduce Google BigQuery and dbt in this data pipeline to explain modeling.</p><p>Ref: <a href="https://koyaaarr.medium.com/dbt-and-bigquery-in-practice-transform-stock-data-1771e2393319">https://koyaaarr.medium.com/dbt-and-bigquery-in-practice-transform-stock-data-1771e2393319</a></p><h3><strong>Engineering Perspective: Architecture and Software</strong></h3><p>Finally, select appropriate software and services and combine them to realize my system. Here is the whole architecture.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4_bi6NIc0pPMfp_mZ4vd9g.png" /><figcaption>Architecture</figcaption></figure><p>Let me explain each component for each role.</p><p><strong>Data Retrieve, Transform, Accumulate Script<br></strong>- Cloud Function: for data retrieving, transforming, and accumulating<br>- Cloud Storage: data will be served from here via API<br>- pandas-datareader: to get stock data<br>- gcsfs: to get data from Cloud Storage</p><p><strong>Data Visualize Application<br></strong>- Cloud Run: run application containerized with Docker.<br>- Cloud IAP(Identity-Aware Proxy): add authentication to Cloud Run app without coding<br>- Streamlit: serve a dashboard quickly and nicely<br>- Plotly: plot graphs quickly and nicely</p><p><strong>Operation, CI/CD<br></strong>- Cloud Build: Connect with GitHub to automatically and immediately deploy to Cloud Run / Cloud Functions after git push<br>- Cloud Scheduler: trigger Cloud Function regularly<br>- Cloud Pub/Sub: the same purpose with scheduler</p><h3><strong>How to use it regularly</strong></h3><p>I use the YAML file to simplify the operation of managing my portfolio. It is to configure my portfolio that contains the number of stocks I have and the details of each stock.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/32dd9bb9503fdab8ee19d1e6112535b6/href">https://medium.com/media/32dd9bb9503fdab8ee19d1e6112535b6/href</a></iframe><p>All I need to do is to modify the number of stocks I hold in this YAML file when I buy some stocks. After git push, Cloud build detects that and copies the YAML file to Cloud Storage automatically, then Cloud Function calculates according to the data so Cloud Run can fetch the latest data from there.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*noXtRgxw3mc1QD6pTQ1w7Q.png" /><figcaption>Operation</figcaption></figure><p>Lastly, what my dashboard looks like is this.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*fO95MwAoT6V4y3bfudTsVg.png" /><figcaption>Authentication is required</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*IlT8rduJsT_Rpb7V3jLAEw.png" /><figcaption>Dashboard Overview</figcaption></figure><h3><strong>Conclusion</strong></h3><p>I intend to break down my application into three perspectives. There are a lot of valuable services like Cloud Run and Cloud IAP. These look complicated to use but are quite helpful in building an application quickly, so I strongly recommend diving into there. This article explains how to create the dashboard using Google Cloud and Python Streamlit. I hope you will find this helpful.</p><h3><strong>Reference</strong></h3><p>- Data Science Venn diagram (<a href="http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram">http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram</a>)<br>- Trading View (<a href="https://www.tradingview.com/">https://www.tradingview.com/</a>)<br>- iPhone Stocks app (<a href="https://apps.apple.com/us/app/stocks/id1069512882">https://apps.apple.com/us/app/stocks/id1069512882</a>)<br>- Enabling IAP with Cloud Run (<a href="https://cloud.google.com/iap/docs/enabling-cloud-run">https://cloud.google.com/iap/docs/enabling-cloud-run</a>)</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=a66e60d62ca8" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[I, as a data scientist, will show you why Jupyter Notebook and Jupyter Lab are good for data…]]></title>
            <link>https://koyaaarr.medium.com/i-as-a-data-scientist-will-show-you-why-jupyter-notebook-and-jupyter-lab-are-good-for-data-ee507aff41cb?source=rss-69519ff4d58c------2</link>
            <guid isPermaLink="false">https://medium.com/p/ee507aff41cb</guid>
            <category><![CDATA[python]]></category>
            <category><![CDATA[jupyter-notebook]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[exploratory-data-analysis]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Ryo Koyajima / 小矢島 諒]]></dc:creator>
            <pubDate>Thu, 20 May 2021 20:40:27 GMT</pubDate>
            <atom:updated>2021-06-10T19:14:22.336Z</atom:updated>
            <content:encoded><![CDATA[<h3>I, as a data scientist, will show you why Jupyter Notebook and Jupyter Lab are good for data analysis</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ENb7LhO8QFqIx8plvHd-mw.png" /><figcaption>This is how data is visualized using Jupyter Lab in the demo in this article</figcaption></figure><h3>For those who want to get started with data analysis in Python</h3><p>This article will introduce Jupyter Notebook and Jupyter Lab (collectively called Jupyter), very reliable tools for data analysis in Python.</p><p>Jupyter is already in common use in the data science world, but I would like to show its benefits with a demo.</p><h4>Assumptions</h4><p>In this article, I analyze data under the following conditions.</p><ul><li>Analyze table data, not unstructured data such as images and texts</li><li>Analyze data of several GB or tens of thousands of records, rather than data of several TB or hundreds of millions of records</li><li>Do the exploratory analysis, rather than routine analysis</li></ul><h4>What is not written</h4><p>The following items are not covered in this article. If you want to use Jupyter after reading this article, please refer to other websites or books.</p><ul><li>How to build Jupyter Notebook and Jupyter Lab environment</li><li>Basic operations of Jupyter Notebook and Jupyter Lab</li><li>How to use pandas data structures and methods</li></ul><h3>What a time-consuming process data analysis is!</h3><p>Exploratory data analysis is time-consuming. I think this is because it requires thousands of trial and error. In conventional development, trial and error often mean fixing a bug in the code or modifying an algorithm. However, there is much more trial and error from the data perspective when it comes to data analysis. You have to look at the data from various angles, verify the quality of the data, and even modify the code when you realize that the data definition you heard from the business department is different…</p><p>Therefore, in exploratory data analysis, <strong>it is important to be able to do trial and error as quickly as possible</strong>.</p><p>You also need to report the results of the exploratory data analysis to your boss or clients. Because of the nature of reporting analysis results, the report (PowerPoint, etc.) will contain many tables and graphs, which is an unexpectedly difficult and time-consuming task.</p><p>So, <strong>it is also important to be able to prepare tables and graphs quickly</strong>.</p><h3>Two benefits of Jupyter</h3><p>Time-consuming is the bane of exploratory data analysis but Jupyter can alleviate this bane. For example, it has the following advantages.</p><ul><li><strong>Faster trial and error iteration</strong><br>- You can get execution results for each row (each cell)<br>- Variables are saved while Jupyter is running so that you can use them multiple times</li><li><strong>Easy to see the execution results</strong><br>- Tabular data is easy to read<br>- Graphs are printed right below the code</li></ul><p>I would like to demonstrate these benefits with a demo.</p><h3>Demo with rental apartments data</h3><p>I will use rental apartment data in Chuo-ku, Tokyo that I got from <a href="https://suumo.jp/chintai/">SUUMO</a>(a Japanese rental apartments website) to demonstrate the advantages of Jupyter(*1). I like Jupyter Lab, so I will use it for this demo.</p><p>The purpose of the data exploration is to visualize the distribution of rents fee of rental apartments.</p><p>First, we need to import pandas and load the data. If a character encoding error occurs due to Japanese or Windows characters, pass encoding=’CP932&#39; as an argument.</p><pre># Load the library<br>import pandas as pd</pre><pre># Read in the data<br>apart = pd.read_csv(&#39;apartments_20210410_chuo.csv&#39;)</pre><p>Once the data has been read, use the <strong>head()</strong> method to display and check the data. <strong>This head() method is so good that you can see the tabular data very easily and clearly. </strong>In my opinion, it is possible to use this table’s screenshot for reports. (Of course, it depends on who you are reporting to. If you are working with an external client, it is better to export to a CSV file and use a PowerPoint table.)</p><pre># Display the data<br>apart.head()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*FLDpMznJeESUM1TrJacehQ.png" /><figcaption>Image by author</figcaption></figure><p><em>The default output is 5 lines, but you can change the output lines passing a number as an argument. In my usage, I use 5 lines (default) when I want to see the columns and values of the data, 1 line when I want to save the data to see later, and 100 lines when I want to see the data itself.</em></p><p>The purpose of this demo is to visualize the rent. The rent is in the form of “10 万円” which contains kanji so we need to omit these characters and convert them into an int type number.</p><p>We will combine the lambda expression with the map function to fix the rent column. <strong>One of the advantages of Jupyter is that you can iterate like this, thinking and executing processes on the fly.</strong> (This may be a good point about the interactive environment rather than Jupyter…)</p><pre># Erase &#39;円&#39;<br># If there is &#39;万&#39;, remove it and multiply by 10000<br>apart[‘rent_yen’] = list(map(lambda x: x.replace(‘円’, ‘’), apart.rent))<br>apart[‘rent_yen’] = list(map(lambda x: float(x.replace(‘万’, ‘’))*10000 if ‘万’ in x else x, apart.rent_yen))<br>apart[‘rent_yen’] = apart[‘rent_yen’].astype(‘int’)</pre><p><em>There is a function called </em><strong><em>apply()</em></strong><em> in pandas that can do the same thing as </em><strong><em>map()</em></strong><em>, but I recommend using </em><strong><em>map()</em></strong><em> for its speed. However, </em><strong><em>map()</em></strong><em> can only process one column of the </em><strong><em>DataFrame</em></strong><em> at a time, so if you need to process values from multiple columns in one row at the same time, use </em><strong><em>apply()</em></strong><em>.</em></p><p>By the way, when you rent an apartment in Japan, you usually sign a two-year contract. condominium fees and gratuity are also required. To calculate the whole cost more accurately, let’s try to calculate the cost over two years. Specifically, we will calculate the sum of two years of rent (24 months), plus two years of condominium fees (24 months), plus the gratuity.</p><pre># Erase ‘円’<br># If there is ‘万’, remove it and multiply by 10000<br>apart[‘condo_fee_yen’] = list(map(lambda x: x.replace(‘円’, ‘’), apart.condo_fee))<br>apart[‘condo_fee_yen’] = list(map(lambda x: float(x.replace(‘万’, ‘’))*10000 if ‘万’ in x else x, apart.condo_fee_yen))<br>apart[‘condo_fee_yen’] = apart[‘condo_fee_yen’].astype(‘int’)</pre><p>We will convert the condominium fee into yen the same as rent.</p><p>But when we applied the same function, we get an error. It seems that it could not be converted to a numeric type because there was a hyphen.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*uHLRH8N1L8DfmstWazDz8A.png" /><figcaption>Image by author</figcaption></figure><p>We check the data and we will see that hyphen indicates that the condominium fee is free.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*XZoERzja6_6nYv3MCa8BIA.png" /><figcaption>Image by author</figcaption></figure><p>Even if errors occur, Jupyter itself is still running. <strong>Thus the variables and libraries that have been calculated and loaded are still alive, so we can try again. This is another advantage of Jupyter.</strong></p><p>We’ll create a function to handle hyphens, but it’s a bit too complicated to write it as a lambda function so we’ll write it as a method.</p><pre>def extract_jpy(x):<br>  “””<br>  — Erase ‘円’<br>  — hyphen is replaced by 0<br>  — If there is ‘万’, remove it and multiply by 10000<br>  — convert into integer<br>  “””<br>  x = x.replace(‘円’, ‘’)<br>  x = x.replace(‘-’, ‘0’)<br>  if ‘万’ in x:<br>    x = x.replace(‘万’, ‘’)<br>    x = float(x)*10000<br>  return int(x)</pre><pre>apart[‘condo_fee_yen’] = list(map(extract_jpy, apart.condo_fee))</pre><p>It looks like we have successfully converted the condominium fee into a number.</p><p>We can now do the same for the gratuity.</p><pre>apart[‘gratuity_yen’] = list(map(extract_jpy, apart.gratuity))</pre><p><em>Since we are doing the same processing here as for the condominium costs, we can copy the cells and use them. Jupyter has some useful shortcut keys that can be used for quick operations. In particular, I often use c: copy cell, x: cut cell, v: paste cell, z: undo cell operation, a: add new cell above, b: add new cell below. I also recommend using ESC: switch to cell operation mode and Enter: switch to code input mode, as they will accelerate your work.</em></p><p>We can see the three columns we have created have been handled well.</p><pre>apart.head()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*tZ2RDaQGKpea2mteq7lQ1w.png" /><figcaption>Image by author</figcaption></figure><p>It looks like “rent_yen”, “condo_fee_yen”, and “gratuity_yen” are all well extracted as numerical values.</p><p>Now, let’s calculate the whole cost for two years using the pandas apply function.</p><pre>apart[‘cost_2years’] = apart.apply(lambda x: (x.rent_yen + x.condo_fee_yen)*24 + x.gratuity_yen, axis=1)</pre><pre>apart.head()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*hyJ8uZEWQ2hU14Gsu3EeGA.png" /><figcaption>Image by author</figcaption></figure><p>It looks like we have successfully calculated the whole cost over two years. We are now ready to visualize the data.</p><p>Now we will visualize the data. We will use <a href="https://plotly.com/python/">plotly</a>. This is my favorite library because of its ease of use and beautiful visualization. <strong>In particular, the appearance is great so that it can be used for PowerPoint as is.</strong> (Unlike seaborn/matplotlib, Japanese is not garbled by default, which is also nice.)</p><pre># Load the library<br>import plotly.express as px</pre><pre># Visualize<br>fig = px.histogram(apart, x=&#39;cost_2years&#39;)<br>fig.show()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*P1qByG7-rDA1a9A3IUDxDA.png" /><figcaption>Image by author</figcaption></figure><p>The histogram shows a wide distribution, from 200k to 23M. 20M JPY apartments are too expensive to live for me, so we’ll filter the threshold to 10M, which covers most of the data.</p><pre>fig = px.histogram(apart.query(‘cost_2years &lt;= 10000000’), x=’cost_2years’)<br>fig.show()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*bwNCfqHrvbqY6wPYcvZjUg.png" /><figcaption>Image by author</figcaption></figure><p>We can see that there are several mountains in this graph. The distribution might be different depending on the room layout, so let’s try to visualize it by color-coding according to the layout.</p><pre>fig = px.histogram(apart.query(‘cost_2years &lt;= 10000000’), x=’cost_2years’, color=’layout’,barmode=’overlay’)<br>fig.show()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ENb7LhO8QFqIx8plvHd-mw.png" /><figcaption>Image by author</figcaption></figure><p>We can see that the distribution differs depending on the room layout. Now let’s compare the distribution of 1K and 1LDK. Since most of the data is up to 7M JPY, we will filter by 7M.</p><pre>fig = px.histogram(apart.query(‘cost_2years &lt;= 7000000 and (layout == “1K” or layout == “1LDK”)’), x=’cost_2years’, color=’layout’,barmode=’overlay’)<br>fig.show()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*JcZcVX2AmbtOPGNXjj9dVg.png" /><figcaption>Image by author</figcaption></figure><p>We can now visualize that the distribution is neatly divided into two mountains. This is the end of the analysis in this demo, but there are many things to explore, such as what contributes to the price distribution besides the room layout.</p><p>In this way, <strong>Jupyter is efficient when you look at data for the first time, and you don’t know what kind of data, what kind of data type, what kind of data format, and what kind of distribution, or the work that needs to be done comes up while exploring data.</strong></p><h3>Appendix: Issues with Jupyter and how to solve them</h3><p>Finally, I’d like to list some of my concerns about using Jupyter and how to handle them. The increase of technical debt is a problem not only for Jupyter, but also for machine learning systems, and I think there is still room for improvement.</p><ul><li>Appearance<br>&gt; In JupyterLab, you can choose a dark theme by default.</li><li>Code Completion<br>&gt; Use a library for completion (such as <a href="https://github.com/krassowski/jupyterlab-lsp">jupyterlab-lsp</a>)</li><li>Increasing technical debt<br>&gt; Use <a href="https://github.com/mwouts/jupytext">jupytext</a> to generate .py files and version them with git<br>&gt; Cut code into py files as needed and use them as methods<br>&gt; Write documentation<br>&gt; Write test code</li></ul><p>*1: Data collection from websites for data analysis doesn’t violate any laws in Japan unless it relates to personal information or it putting a high workload on the servers.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ee507aff41cb" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Between Machine Learning PoC and Production]]></title>
            <link>https://medium.com/swlh/between-machine-learning-poc-and-production-618502abef86?source=rss-69519ff4d58c------2</link>
            <guid isPermaLink="false">https://medium.com/p/618502abef86</guid>
            <category><![CDATA[mlops]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[airflow]]></category>
            <category><![CDATA[aws]]></category>
            <dc:creator><![CDATA[Ryo Koyajima / 小矢島 諒]]></dc:creator>
            <pubDate>Mon, 01 Feb 2021 17:54:02 GMT</pubDate>
            <atom:updated>2021-02-05T14:15:56.332Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*RsYYFRUCvRGWQBb8TmSD6g.png" /><figcaption>the final architecture of this article</figcaption></figure><p><em>The Japanese version is here:</em><br>(<a href="https://qiita.com/koyaaarr/items/259ad4f0d574497c5b08">https://qiita.com/koyaaarr/items/259ad4f0d574497c5b08</a>)</p><h3>Introduction</h3><p>Machine learning Proof of Concept (PoC) is very popular these days due to the recent AI boom. And afterward, if (very fortunately) you get good achievement in the PoC, you may want to put the PoC system into production. However, while a lot of knowledge has been shared about exploratory data analysis and building predictive models, there is still not much knowledge on how to put them into practice, especially in production.</p><p>In this article, we will examine what is needed technically during the transition from PoC to production operations. I hope that this article will help you to make your machine learning PoC not only transient but also create value through production.</p><h4><strong>What is written in this article</strong></h4><ul><li>How to proceed with data analysis in a PoC</li><li><strong>How to proceed with the test operation of the machine learning PoC (the main topic of this article)</strong>.</li><li><strong>Architecture in each phase of PoC and test operation (the main topic of this article)</strong>.</li><li>Additional things to consider for production operations</li></ul><p>I will focus especially on test operations. During test operations, operations and analysis are often done in parallel, and I will describe an example of how to update the architecture of the system while balancing operations and analysis.</p><h4>What is not written in this article</h4><ul><li>Details on exploratory data analysis</li><li>Details on preprocessing and feature engineering</li><li>Details on building predictive models</li><li>Lower layers than middleware (databases and web servers)</li><li>Consulting skills to handle Machine Learning PoC</li></ul><p>Consulting skills are very important in machine learning projects because of their uncertainty but are not included in this article as the focus is on the technology.</p><h4>Systems assumed in this article</h4><ul><li>Use a relatively small dataset, less than 100 GB</li><li>Handle data that can be stored in memory, rather than data in the hundreds of millions of records</li><li>Batch learning and batch inference</li><li>Not perform online (real-time) learning and inference</li><li>System construction proceeds in parallel with data analysis</li><li>Not have concrete requirements to create in the beginning so we build them as needed while proceeding</li></ul><h4>Data used in this article</h4><p>We will use data from a previous Kaggle competition, “<a href="https://www.kaggle.com/c/home-credit-default-risk/data">Home Credit Default Risk</a>” in this article. This competition uses an individual’s credit information to predict whether or not they will default on their debt. There are records for each loan application in the data, and each record contains information about the applicant’s credit and the label indicating whether the person was able to repay the loan or defaulted on it.</p><p>In this article, we will assume that we are in the data analytics department of a certain loan lending company. Under this assumption, you want to utilize machine learning to automate credit decisions based on this credit information.</p><p>For the sake of explanation, we will divide “application_train.csv” among the data available in this competition as shown in the figure. The split data will be used under the following assumptions.</p><ul><li>initial.csv: Past credit information, to be used in PoC</li><li>20201001.csv: Credit information for October 2020. In the test operation, this data will be handled as training data together with “initial.csv”.</li><li>20201101.csv”: Credit information for November 2020. In the test operation, this data is handled together with “initial.csv” as training data.</li><li>“20201201.csv”: Credit information for December 2020. In test operations, this data is handled together with “initial.csv” as training data.</li><li>“20210101.csv”: Credit information for January 2021. In the test operation, we will start forecasting from this month.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*PALcBQQnYkE-MeEQwV90xA.png" /></figure><p>The actual code for splitting the data is shown below.</p><p><strong>split_data.ipynb</strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/4672e04f6b37c9cbb32648e60a3a7732/href">https://medium.com/media/4672e04f6b37c9cbb32648e60a3a7732/href</a></iframe><h4>Situation to be considered</h4><p>In this article, for ease of explanation, we will assume the following project. The following story is based on the author’s imagination based on the data of “Home Credit Default Risk” and has nothing to do with the actual company or business. The author is a complete novice in the field of credit operations and may differ greatly from actual operations.</p><p><em>As a data scientist, I am participating in a project to automate the credit approval process at a loan lending company. The credit judgment work is done manually by the screening department, but we are considering whether machine learning can be used to reduce man-hours and improve the accuracy of credit judgment. Sample data has already been provided, and we are in the PoC stage. The sample data is a record of past loan defaults by borrowers. Based on this data, if someone wants to take out a new loan, we would like to be able to predict whether or not that person will default on the loan so that we can decide whether or not to lend the loan.</em></p><h4>Scope of the project in this article</h4><p>A machine learning project usually goes through planning, PoC, test operation, and production operation. In this article, to focus on the technical points, I will describe the scope from PoC to test operation. In particular, I will divide the test operation into three phases, since a lot of functions are required to move to production. Since the author has little experience in production operations, I only mention the points that should be considered for production operations.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/847/1*bThro90LzCfa-zkKczKFpA.png" /></figure><h4>Structure assumed in this article</h4><p>In this article, I will assume a minimal structure, as we are going to start the project small. Specifically, there is a consultant who will communicate with the business department (the credit judgment department) and a data scientist who will perform everything from data analysis to system construction. In reality, there is a manager as a supervisor, but they will not appear in this article. Also, as a stakeholder, there is a person in the business department.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/666/1*i8mVMCGU1tUgJnWyPV89nA.png" /></figure><h3>PoC phase</h3><h4>Purpose of this phase</h4><p>The purpose of this phase is to verify whether it is feasible to automate credit decisions. In this phase, we will examine two main points: one is to validate the data otherwise whether the provided data can be used in production (e.g. whether the data can be used in forecasting and whether there is no relationship between records), and the other is to determine how accurately the defaults can be predicted by machine learning.</p><h4>Architecture in this phase</h4><p>In this phase, we will work only with JupyterLab. MLflow is included for storing the machine learning models, but (in my opinion,) it is not necessary at the beginning.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/444/1*D8uq3k9DmFZlWHRJ0mUNSw.png" /></figure><h4>Data validation</h4><p>If you are a data scientist, you want to start looking at the data right away, but first, you need to validate the data. This is because if the data is flawed, any predictions made using the data will likely be useless. Validation includes two main points: first, for each record, when is each column data available. The data for each column may seem to be available at the time it is provided to us, but that doesn’t mean that they are available at the same time. For the simplest example, the objective variable “whether the debtor has defaulted” will be known later than the other columns. Another point to check is to see if there is any relationship between the records. For example, if a person applied for a loan twice and the first record in the training data and the second record in the test data, the prediction will take an advantage in a bad way. In such a case, you can make sure that both records are included in either the training data or the test data. In addition to these points, it is also important to clarify the definition of the data by interviewing the business department about what each column means and what the unit of the record is (e.g. in this data, is it per person or loan application?). You may use a spreadsheet to check these checkpoints for each column of the data.</p><h4>Exploratory data analysis</h4><p>Once the data has been validated (or in parallel with the validation), we can use <a href="https://github.com/jupyterlab/jupyterlab">Jupyter Lab</a> to see what columns (features) are present by visualizing the sample data. This process will help you understand the data and do feature engineering and model selection. It is also useful to find problems in the data.</p><p>First, for each column, we will check the data type, percentage of missing values, etc.</p><p><strong>eda.ipynb</strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/3a30e8fa48960687f41db7ea31b17227/href">https://medium.com/media/3a30e8fa48960687f41db7ea31b17227/href</a></iframe><p>Next, to see the distribution, we will visualize it. If the data type is numeric, we will use a histogram, and if the data type is a string, we will use a bar chart.</p><p><strong>eda.ipynb</strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/d79447f5a923d717818031a6bff49270/href">https://medium.com/media/d79447f5a923d717818031a6bff49270/href</a></iframe><p>Two of the output graphs will be shown as examples. In fact, we should look at the distributions one by one, but we will skip that for now.</p><p><strong>AMT_CREDIT</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*efb41Vd0_IwCh7b1-B7C6Q.png" /></figure><p><strong>NAME_INCOME_TYPE</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_eGzsCsdp_lB1JHza8ziMA.png" /></figure><h4>Verification of prediction accuracy</h4><p>From here, we will actually create the model and verify the prediction accuracy. In this case, we will use the AUC of ROC, which is the same evaluation indicator used in “Home Credit Default Risk”. In reality, we will discuss with the business department and agree in advance on which indicator to use. Before creating a prediction model manually, we will first try to make a quick prediction using <a href="https://github.com/pycaret/pycaret">PyCaret</a>. This will allow us to compare which features/models are effective and use them as a reference when actually creating the model.</p><p><strong>eda.ipynb</strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/7456778d496f89f40c10198278f28d7c/href">https://medium.com/media/7456778d496f89f40c10198278f28d7c/href</a></iframe><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/8957efdfb5d50e3232d8f7af653e2ee4/href">https://medium.com/media/8957efdfb5d50e3232d8f7af653e2ee4/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*9VmzoRO4o8J36rxYjhCA1g.png" /></figure><p>In this article, we will compare the following models provided by PyCaret.</p><ul><li>Logistic regression</li><li>Decision Trees</li><li>Random Forest</li><li>SVM</li><li>LightGBM</li></ul><p>LightGBM seems to be superior when the evaluation metric is AUC. In general, LightGBM seems to be better in both accuracy and execution speed in most cases. By the way, recall is small in all models because of the imbalanced data with few positive examples. Depending on your business goals, you may create a model with a high recall score so that you prevent more bad debts. In this article, we will not do any more detailed modeling and will use LightGBM to build models.</p><p>Next, we will create and evaluate a LightGBM model in PyCaret to see which features are effective.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/9163ca76116d5e99d346505f21df60eb/href">https://medium.com/media/9163ca76116d5e99d346505f21df60eb/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/451/1*iKMCGUZGw1BQ6-b4Kclidg.png" /></figure><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/aa54155d4410a4c1b439cd2cbfdc3cab/href">https://medium.com/media/aa54155d4410a4c1b439cd2cbfdc3cab/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/227/1*VLj4tLF_GFczabjtP2cR4Q.png" /></figure><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/fb4ea58c279eb14d75d1d97dadedaef7/href">https://medium.com/media/fb4ea58c279eb14d75d1d97dadedaef7/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/847/1*IjFSZBY5BKB4-b9fB9-6zQ.png" /></figure><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/e9b3430d1ebb64edd054998a1fdde301/href">https://medium.com/media/e9b3430d1ebb64edd054998a1fdde301/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/451/1*pl5eL0u0wFbyOpYOVoTsXA.png" /></figure><p>If there are a lot of features, as in the case of this data, reducing the number of the features will increase both the accuracy and stability of the model. A simple way to do that is to calculate the feature importance and exclude the features with low importance. In this case, we will simply use the features with high importance. For the columns that are automatically preprocessed by PyCaret, we will use the original columns.</p><p>Now, we will create the prediction model manually.</p><h4>Preprocessing</h4><p>For the sake of simplicity, we will only complement the missing values in the preprocessing.</p><p><strong>forecast.ipynb</strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/f8ee3555899de8902f083f45c73dd5b4/href">https://medium.com/media/f8ee3555899de8902f083f45c73dd5b4/href</a></iframe><h4>Feature Engineering</h4><p>Feature engineering involves feature selection and creating dummy variables of categorical features.</p><p><strong>forecast.ipynb</strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/1fc066cb2e66c48b6b2948495ac6a505/href">https://medium.com/media/1fc066cb2e66c48b6b2948495ac6a505/href</a></iframe><h4>Prediction</h4><p>Use LightGBM to create a model. Also, use <a href="https://github.com/optuna/optuna">Optuna</a> to tune hyperparameters.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/4134ef5807c4f42fcfd526da53fee1fb/href">https://medium.com/media/4134ef5807c4f42fcfd526da53fee1fb/href</a></iframe><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/64476a8da29c74b46c2f21ce2ccb22a4/href">https://medium.com/media/64476a8da29c74b46c2f21ce2ccb22a4/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/397/1*-jZBXtl8Ic19J9cKtAnJvQ.png" /></figure><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/e6eec40e65d5a47eeb899dea70607748/href">https://medium.com/media/e6eec40e65d5a47eeb899dea70607748/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/397/1*ZDB0O0ECuFJf5E1TNnHxbw.png" /></figure><p>In this verification of the prediction accuracy, we were able to achieve almost the same accuracy using PyCaret. In reality, we will conduct a more in-depth analysis based on these results, but we finish the verification of the PoC phase with this.</p><p>From here on, we will assume that the results of the PoC will be reported to the business department, and this project proceeds through PoC to production. However, the PoC will not suddenly go into production. The PoC system will be gradually brought closer to production through several test operations. Therefore, we will divide the test operation into three phases. In each phase, we will add functions little by little so that the operation will be gradually automated and get closer to the production operation.</p><h4>Supplement: Machine learning model management</h4><p>For managing machine learning models, <a href="https://github.com/mlflow/mlflow">MLflow</a> is useful. It can manage models with each hyperparameter explored by Optuna, which will be useful as the number of model trials increases.</p><h3>Test Operation</h3><h4>The three phases of test operations</h4><p>Before we can go from PoC to production, we need to implement some features such as automation of operations. However, it would be difficult in terms of man-hours to implement all the necessary functions right away. (Besides, at this stage, you are probably being asked by the business department to further improve the accuracy.) Therefore, we will divide the necessary functions into three phases and implement them gradually, so that we can expand the functions as we operate. In each phase, we will implement the following functions respectively:</p><ol><li>Building data pipelines and semi-automated operations</li><li>Implementation of regular operation API</li><li>Migration to the cloud and automation of operations</li></ol><h3>Test Operation Phase 1: Building data pipeline and semi-automated operations</h3><h4>Purpose of this phase</h4><p>In this phase, we will partially automate the system created in the PoC. Before that, we will build a data pipeline by dividing and organizing the PoC program into blocks such as feature engineering and prediction. This will allow the training and inference to be executed in isolation or rerun from the middle. Besides, Airflow, a workflow engine, is introduced to enable automatic execution and scheduling execution of all programs divided into each block in order.</p><h4>Architecture in this phase</h4><p>In the PoC phase, we used a single Jupyter Notebook for preprocessing and prediction, and so on, but from this phase, we will introduce two OSS to execute multiple Notebooks in order. The first is “<a href="https://github.com/nteract/papermill">papermill</a>”, an OSS that allows us to run Jupyter Notebooks from the command line with parameters so that we can make predictions for different months without rewriting notebooks. Besides, use “<a href="https://airflow.apache.org/">Airflow</a>” to run each Notebook in order. This OSS provides not only automatic execution, but also scheduling execution, success and failure notifications, and other useful functions for operational automation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/723/1*NNDUuve6Ol4VkDNLQD2QxA.png" /></figure><h4>Data pipeline</h4><p>Divide the program created by PoC into four blocks: “data accumulation”, “feature engineering”, “learning” and “inference”. When dividing the program into blocks, each block should be loosely coupled to each other by using data as an interface. This will limit the impact of changes in the program logic. For reference, here is an image of the data pipeline in this article. In each block, the month of execution is set to be passed as a parameter from papermill at the beginning of the program, so that it can be executed in a specific month.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/747/1*UizJVLTBV3mZb1UwD-CEQA.png" /></figure><p>The following is the code for each block. Basically, it is a reuse of the program used in the PoC, with some additions and modifications for operational automation.</p><p><strong>Data accumulation</strong></p><p><strong>accumulate.ipynb</strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/a299dc63ffb80d9b7b934178ae46285a/href">https://medium.com/media/a299dc63ffb80d9b7b934178ae46285a/href</a></iframe><p><strong>Feature engineering</strong></p><p><strong>feature_engineering.ipynb</strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/e2d6b57c4657ae346cc03e6704e60cf8/href">https://medium.com/media/e2d6b57c4657ae346cc03e6704e60cf8/href</a></iframe><p><strong>Learn model</strong></p><p><strong>learn.ipynb</strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/e37e0ba25a9b1fe9532c54d08f58c0e8/href">https://medium.com/media/e37e0ba25a9b1fe9532c54d08f58c0e8/href</a></iframe><p><strong>Inference</strong></p><p><strong>inference.ipynb</strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/13cfbdf8a3acec4fc30c4993cb58bbb9/href">https://medium.com/media/13cfbdf8a3acec4fc30c4993cb58bbb9/href</a></iframe><h4>Semi-automating operations</h4><p>Once each process has been split into individual programs, Airflow can be used to execute them in an ordered manner. By passing the forecasted month as a parameter at runtime, we can run for each month. Also, if you want to schedule the execution, you can define the date and time of the scheduling execution as a cron expression in “schedule_interval”. The Airflow code is shown below.</p><p><strong>trial_operation.py</strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/a1b231f1fafa1f8fcead4b8706fca060/href">https://medium.com/media/a1b231f1fafa1f8fcead4b8706fca060/href</a></iframe><p>You can view your defined workflow as a flowchart in Airflow. For example, the above code can be visualized as the following figure. You can see that this diagram has the same structure as the data pipeline we defined earlier. (In the figure, each box is green because the blocks have already been completed successfully.)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/809/1*fb4TtnTdsa2BhjPKGrrvBg.png" /></figure><p>With the implementation of test operation phase 1, we were able to automate the monthly operations as shown below. We can see that most parts are becoming greatly automated.</p><ul><li>PoC Phase</li></ul><ol><li>upload data for the forecast month</li><li>Combine training data of previous months</li><li>Preprocessing and feature engineering of training data</li><li>Train model from training data</li><li>Preprocess test data and do feature engineering</li><li>Predict the test data using the trained model</li><li>download the prediction result</li></ol><ul><li>Test Operation Phase 1</li></ul><ol><li>Upload the data for the forecast month</li><li>Run the Workflow from Airflow</li><li>Download the prediction result</li></ol><h3>Test Operation Phase 2: Implementation of regular operation API</h3><h4>Purpose of this phase</h4><p>In phase 1, we were able to greatly automate monthly operations by dividing functions such as preprocessing and inference into separate programs and execute in order by combining papermill and Airflow. In this phase 2, we will further automate the process. Specifically, we will prepare APIs and GUI screens to execute data upload/download and regular operations, which were done manually in Phase 1. In this way, even non-engineering users such as consultants and business departments will be able to operate the system easily. In this way, the regular operations can be left to the users, and the engineers can concentrate more on the development tasks.</p><h4>Architecture in this phase</h4><p>In phase 2, we will build a web server and create a GUI screen to operate it.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*YWwiR-2IQ8GOYDeBTVAK5w.png" /></figure><h4>Creating a web server</h4><p>Prepare the following APIs for the web server.</p><ul><li>Upload function for input files</li><li>Execution of regular operations</li><li>Download function of forecast files</li></ul><p>This time, we will use <a href="https://github.com/tiangolo/fastapi">FastAPI</a> to create the webserver.</p><p><strong>server.py</strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/d004817103635170bfb2dd5a70e60c42/href">https://medium.com/media/d004817103635170bfb2dd5a70e60c42/href</a></iframe><h4>Creating the GUI screen</h4><p>For the GUI, we need a button to execute the web server API and a form to upload data. In this case, I used <a href="https://github.com/facebook/react">React</a> and <a href="https://github.com/microsoft/TypeScript">Typescript</a> to create the GUI on my own, but it may be faster to use a library that creates the GUI, such as <a href="https://github.com/streamlit/streamlit">streamlit</a>.</p><p><strong>App.tsx</strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/fb399aeb798b3f4f65ac182f5d0b813e/href">https://medium.com/media/fb399aeb798b3f4f65ac182f5d0b813e/href</a></iframe><p>GUI screen is like the following image.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/330/1*je_O2f3OSu6vTYVtn-yQ8g.png" /></figure><h3>Test Operation Phase 3: Migration to the cloud and automation of operations</h3><h4>Purpose of this phase</h4><p>In Phase 3, we will move servers to the cloud and move some functions to managed services to further automate regular operations. The purpose of using the cloud is to increase the availability of the system by delegating operations such as infrastructure to the cloud so that we can focus more on enhancing and maintaining the application. The basic functions are common to all the clouds such as AWS, GCP, and Azure, but each of them has different features and characteristics, so I think it is better to compare them.</p><p>In this article, I will briefly examine migration to AWS as an example. There are two migration examples: Pattern 1, in which the system created up in the test operation phase 2 is migrated simply to AWS, and Pattern 2, in which further automation is performed.</p><h4>Architecture Pattern 1 with AWS: Simple EC2-only configuration</h4><p>Each server is built on EC2, and data is stored in EBS. The usage is almost the same as local Linux machines and migration should not be difficult. However, uploading of input data and downloading of prediction results still needs to be done manually. Also, since each function of the system is just running on EC2, the ease of enhancement and maintenance has not changed much.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/985/1*GCIQ_HMfZPwOuQaB_dITQw.png" /></figure><h4>Architecture Pattern 2 on AWS: Further automated configuration</h4><p>In this pattern 2, the following points that were issued in pattern 1 are improved.</p><ul><li>Automation of data input/output</li><li>Splitting some functions into individual programs and managed services</li></ul><p>To automate data input/output, we use S3 as a shared folder for exchange data with external systems. We can monitor data input/output to S3 using CloudWatch and CloudTrail, and call Airflow’s regular operation API using Lambda. And then, we can run the prediction system by triggering the storage of input files. With this system, there is no need to set up a GUI or a web server. If you set up a web server in the cloud, you will need authentication functions and vulnerability countermeasures, so this will also reduce these risks.</p><p>As for splitting some of the functions into individual programs and services, we did the following points.</p><ul><li>Changed the storage location of input/output files to S3</li><li>Moved the trigger program for system execution to Lambda</li><li>Migrated the success/failure notification program to Lambda and SNS</li></ul><p>The scope that we were able to divide up this time is not very wide, but I think we can divide up the program further by using other AWS services to make it easier to enhance and maintain. However, if you expand the scope too much, you may end up with vendor lock-in, so you need to consider the ease of migration as well.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*RsYYFRUCvRGWQBb8TmSD6g.png" /></figure><p>We have now completed all considerations up to test operation phase 3. Actually, there are many technical and business hurdles in running an on-demand analysis service of PoC regularly as production, but I hope that the methods we discussed here will be helpful.</p><h3>Additional things to consider for production operations</h3><p>Finally, I will list things to consider for production operations in this chapter.</p><h4>Utilizing the cloud</h4><p>In test operation phase 3, we migrated to the cloud. Since the cloud has a variety of functions, it is best to utilize them to the extent that they do not significantly sacrifice portability. For example, data governance can be introduced by linking the internal authentication with the cloud authentication function, and the auto-scaling function can be used to handle larger-scale data.</p><p>It is also important to eliminate as much of your own code as possible and move to managed services. Considering the long-term operation of the system, you should consider utilizing a service that has similar functions to your program since your own code is not easy to maintain and is also very impersonal. For example, for Airflow, there are managed services such as GCP’s “Cloud Composer” and AWS’s “Amazon Managed Workflows for Apache Airflow”, so using these services is something to consider.</p><h4>Program reusability</h4><p>While Jupyter Notebook is easy and convenient for development, it is not easy to manage, run, and test with git. It may be a good idea to migrate to python files as needed, depending on the combination of development speed and quality. Also, if this system itself can be built on Docker and Kubernetes, it will not only increase the robustness of the system and make it easier to scale the process, but it will also have great business benefits such as making it easier to expand to other projects.</p><h4>Data storing</h4><p>In this article, data was stored in CSV or Pickle format, but it is good to consider which data to be stored in which format. For this purpose, it is useful to manage the definitions of each data in a spreadsheet when the data pipeline is developed. I often use CSV data that is difficult to recreate (input data) or data that requires external collaboration (forecast results), and Pickle format for intermediate-generated data. Pickle format is convenient, but it is not versatile or robust, so it is better to store in CSV format and define the data type separately or use the “Parquet” format if you know.</p><h4>Data monitoring</h4><p>To continuously operate a machine learning system, you need to pay attention to the data as well as the system. For example, if the trend of the input data changes, it may have a significant impact on the prediction accuracy even if there is no problem with the system. Therefore, it is necessary to monitor the input data, for example, to check if the distribution of data in each column and the relationship with the labels have changed. Also, depending on the system you are creating, you need to verify the fairness of the predictions, for example, whether the prediction results vary depending on gender.</p><h4>Data governance</h4><p>At the PoC level, access privileges to data may be naturally limited, but as the operation becomes longer and the number of people involved in the system increases, it will become necessary to set appropriate access privileges for each data. In such cases, it is best to utilize the authentication functions of cloud services. For example, by creating individual accounts with AWS IAM, you can flexibly set access privileges to the data stored in S3 according to each individual’s department or position. Also, since cloud services have functions that can be integrated with internal authentication infrastructure, it is a good idea to use these services.</p><h3>Software and code used in this article</h3><p>The source of the system built as an example in this article is stored in the following GitHub repository.</p><p><a href="https://github.com/koyaaarr/between_poc_and_production">https://github.com/koyaaarr/between_poc_and_production</a></p><p>The versions of the main software used are as follows.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/644/1*m9GiGZT6kZ3ydb8cJBdW9g.png" /></figure><h3>Reference</h3><ul><li>Beyond Interactive: Notebook Innovation at Netflix (<a href="https://netflixtechblog.com/notebook-innovation-591ee3221233">https://netflixtechblog.com/notebook-innovation-591ee3221233</a>)</li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=618502abef86" width="1" height="1" alt=""><hr><p><a href="https://medium.com/swlh/between-machine-learning-poc-and-production-618502abef86">Between Machine Learning PoC and Production</a> was originally published in <a href="https://medium.com/swlh">The Startup</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>