<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://bobrinik.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://bobrinik.github.io/" rel="alternate" type="text/html" /><updated>2025-12-29T01:56:13+00:00</updated><id>https://bobrinik.github.io/feed.xml</id><title type="html">Maksim Bober</title><subtitle>Personal website and blog</subtitle><entry><title type="html">Introducing Poker Gym</title><link href="https://bobrinik.github.io/2025/12/28/introducing-poker-gym.html" rel="alternate" type="text/html" title="Introducing Poker Gym" /><published>2025-12-28T00:00:00+00:00</published><updated>2025-12-28T00:00:00+00:00</updated><id>https://bobrinik.github.io/2025/12/28/introducing-poker-gym</id><content type="html" xml:base="https://bobrinik.github.io/2025/12/28/introducing-poker-gym.html"><![CDATA[<p>I started learning Poker seriously and as part of this effort generated a simple app to help me do this. I’m also trying to learn how Telegram web-apps are working, so I created an app that will be available on Telegram that helps you remember poker combinations. 
I’m sharing it with the world to do a basic idea validation and at the same time collect some feedback.</p>

<h2 id="what-is-poker-gym">What is Poker Gym?</h2>
<p>Poker Gym is a free app available through Telegram where users can practice basic poker skills. For now, it only supports poker combinations. It will have more later.</p>

<h2 id="the-problem">The Problem</h2>
<p>Learning basics of poker is hard; you need to learn it through trial and error. What if you don’t want to play couple of hundred games to get good at it and instead can drill those in an app?</p>

<h2 id="the-solution">The Solution</h2>
<p>Poker Gym lets you practice basics such as learning combinations and rankings.</p>

<h2 id="features-planned">Features (Planned)</h2>
<ul>
  <li>Outs: Get familliar and drill counting outs (unseen cards that combine well with your cards)</li>
  <li>Progress tracking: See your improvement over time</li>
  <li>Leaderboard: See your place in relation to other players</li>
</ul>

<h2 id="looking-for-feedback">Looking for Feedback</h2>

<p>If this sounds interesting to you, I’d love to hear your thoughts:</p>
<ul>
  <li>Would you use a tool like this?</li>
  <li>What specific scenarios would you want to practice?</li>
  <li>What features would make this valuable for you?</li>
</ul>

<p>Feel free to reach out with feedback at https://forms.gle/E2UMLnJqQxX1JnTx9</p>]]></content><author><name></name></author><category term="Poker" /><category term="Idea Validation" /><summary type="html"><![CDATA[I started learning Poker seriously and as part of this effort generated a simple app to help me do this. I’m also trying to learn how Telegram web-apps are working, so I created an app that will be available on Telegram that helps you remember poker combinations. I’m sharing it with the world to do a basic idea validation and at the same time collect some feedback.]]></summary></entry><entry><title type="html">Scaling Databases: The Modulo Hashing Problem Visualized</title><link href="https://bobrinik.github.io/2025/11/29/modulo-hashing-visualizer.html" rel="alternate" type="text/html" title="Scaling Databases: The Modulo Hashing Problem Visualized" /><published>2025-11-29T00:00:00+00:00</published><updated>2025-11-29T00:00:00+00:00</updated><id>https://bobrinik.github.io/2025/11/29/modulo-hashing-visualizer</id><content type="html" xml:base="https://bobrinik.github.io/2025/11/29/modulo-hashing-visualizer.html"><![CDATA[<h2 id="problem">Problem</h2>
<p>The current database cannot handle the volume of incoming write requests. During peak times, there are too many write/update requests incoming per second, so requests are taking longer to execute. You can buffer requests, but if the database cannot fulfill them faster than they arrive, it will overflow. Let’s say you work in a bank and cannot afford to drop any requests.</p>

<h2 id="solution">Solution</h2>
<p>To scale writes, you can either use a bigger database (scale vertically) or use multiple databases (scale horizontally). Let’s say you want to scale horizontally. So you add an extra database. Now, you need to figure out how to forward requests to multiple DBs.</p>

<p>You can do it in a round-robin style. The issue is that requests for user X are persisted across different DBs, which makes querying all records for X slower (we need to query all  DBs to get results) and makes enforcing table constraints more difficult (cross-database referential integrity is handled outside the database engine). Since round-robin is not working for us because we lose referential integrity this way, we need to route requests so that user X always goes to Database 1, so that all of user X’s data is located on Database 1, and the database engine can perform referential-integrity checks for that user.</p>

<p>One way to achieve this is to use the modulo operator. We can take the modulo of <code class="language-plaintext highlighter-rouge">user_id</code> and use the result to determine which database to map our user to. Here’s an example of how it can be done. We can take our ID, convert it to an integer, and perform a modulo operation on it.</p>

<p><img src="/assets/images/2025-11-29-modulo-hashing/image.png" alt="" /></p>

<p>It works nicely as long as your ids are evenly distributed. If our ids are evenly distributed, then each database receives the same number of users. Let’s check if our UUIDs are evenly distributed.</p>

<p><img src="/assets/images/2025-11-29-modulo-hashing/uuid_even_distribution.png" alt="" /></p>

<p>Pretty much evenly distributed, there seems to be some noise around the 2nd bucket, but it will smooth out as numbers increase.</p>

<h2 id="now-whats-the-problem-with-modulo-hashing">Now what’s the problem with modulo hashing?</h2>
<p>Problems with this approach start when we want to re-scale our database. Because when we rescale our database, instead of doing <code class="language-plaintext highlighter-rouge">modulo 5</code> we do <code class="language-plaintext highlighter-rouge">modulo 6</code> and records that were mapped to Database 1 are now going to be mapped to Database 6 and it will need to happen for many records.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>5 % 5 -&gt; 0
6 % 5 -&gt; 1
7 % 5 -&gt; 2

5 % 6 -&gt; 1
6 % 6 -&gt; 0
7 % 6 -&gt; 1
</code></pre></div></div>

<div id="modulo-hashing-root"></div>

<script src="https://unpkg.com/react@18/umd/react.production.min.js"></script>

<script src="https://unpkg.com/react-dom@18/umd/react-dom.production.min.js"></script>

<script src="https://unpkg.com/@babel/standalone/babel.min.js"></script>

<script src="https://cdn.tailwindcss.com"></script>

<script type="text/babel">
const { useState, useMemo } = React;

// Simple icon components
const Database = ({ className }) => (
  <svg className={className} fill="none" viewBox="0 0 24 24" stroke="currentColor" strokeWidth={2}>
    <ellipse cx="12" cy="5" rx="9" ry="3" />
    <path d="M21 12c0 1.66-4 3-9 3s-9-1.34-9-3" />
    <path d="M3 5v14c0 1.66 4 3 9 3s9-1.34 9-3V5" />
  </svg>
);

const ArrowRight = ({ className }) => (
  <svg className={className} fill="none" viewBox="0 0 24 24" stroke="currentColor" strokeWidth={2}>
    <path strokeLinecap="round" strokeLinejoin="round" d="M5 12h14M12 5l7 7-7 7" />
  </svg>
);

const AlertTriangle = ({ className }) => (
  <svg className={className} fill="none" viewBox="0 0 24 24" stroke="currentColor" strokeWidth={2}>
    <path strokeLinecap="round" strokeLinejoin="round" d="M12 9v2m0 4h.01m-6.938 4h13.856c1.54 0 2.502-1.667 1.732-3L13.732 4c-.77-1.333-2.694-1.333-3.464 0L3.34 16c-.77 1.333.192 3 1.732 3z" />
  </svg>
);

const RefreshCcw = ({ className }) => (
  <svg className={className} fill="none" viewBox="0 0 24 24" stroke="currentColor" strokeWidth={2}>
    <path strokeLinecap="round" strokeLinejoin="round" d="M4 4v5h.582m15.356 2A8.001 8.001 0 004.582 9m0 0H9m11 11v-5h-.581m0 0a8.003 8.003 0 01-15.357-2m15.357 2H15" />
  </svg>
);

const ModuloHashingVisualizer = () => {
  const [numKeys, setNumKeys] = useState(20);
  const [shardsBefore, setShardsBefore] = useState(4);
  const [shardsAfter, setShardsAfter] = useState(5);

  const getShardColor = (index) => {
    const colors = [
      'bg-blue-900/50 border-blue-500 text-blue-300',
      'bg-green-900/50 border-green-500 text-green-300',
      'bg-purple-900/50 border-purple-500 text-purple-300',
      'bg-orange-900/50 border-orange-500 text-orange-300',
      'bg-pink-900/50 border-pink-500 text-pink-300',
      'bg-teal-900/50 border-teal-500 text-teal-300',
      'bg-yellow-900/50 border-yellow-500 text-yellow-300',
      'bg-indigo-900/50 border-indigo-500 text-indigo-300',
      'bg-red-900/50 border-red-500 text-red-300',
      'bg-gray-700/50 border-gray-500 text-gray-300',
    ];
    return colors[index % colors.length];
  };

  const data = useMemo(() => {
    let movedCount = 0;
    const records = [];

    for (let i = 0; i < numKeys; i++) {
      const prevShard = i % shardsBefore;
      const newShard = i % shardsAfter;
      const hasMoved = prevShard !== newShard;
      
      if (hasMoved) movedCount++;

      records.push({
        id: i,
        prevShard,
        newShard,
        hasMoved
      });
    }

    return { records, movedCount };
  }, [numKeys, shardsBefore, shardsAfter]);

  const percentMoved = ((data.movedCount / numKeys) * 100).toFixed(1);

  return (
    <div className="p-6 max-w-4xl mx-auto bg-[#191919] rounded-xl border border-[#333] font-sans">
      <div className="mb-6">
        <h2 className="text-2xl font-bold text-[#e0e0e0] flex items-center gap-2">
          <Database className="w-6 h-6 text-[#7cb3f3]" />
          The Modulo Hashing Problem
        </h2>
        <p className="text-[#9a9a9a] mt-2">
          Visualize why simple <code className="bg-[#252525] px-1 rounded text-[#e06c75]">hash(key) % N</code> fails when scaling.
          When the number of shards (N) changes, the result of the modulo operation changes for most keys.
        </p>
      </div>

      {/* Controls */}
      <div className="grid grid-cols-1 md:grid-cols-3 gap-6 bg-[#252525] p-4 rounded-lg border border-[#333] mb-6">
        <div>
          <label className="block text-sm font-semibold text-[#e0e0e0] mb-2">Total Records (Keys)</label>
          <input 
            type="range" min="10" max="100" 
            value={numKeys} 
            onChange={(e) => setNumKeys(parseInt(e.target.value))}
            className="w-full accent-[#7cb3f3]"
          />
          <div className="text-right text-[#9a9a9a] font-mono">{numKeys} Keys</div>
        </div>

        <div>
          <label className="block text-sm font-semibold text-[#e0e0e0] mb-2">Current Shard Count (N)</label>
          <div className="flex items-center gap-2">
            <button 
              onClick={() => setShardsBefore(Math.max(1, shardsBefore - 1))}
              className="px-3 py-1 bg-[#2f2f2f] hover:bg-[#3a3a3a] text-[#e0e0e0] rounded border border-[#333]"
            >-</button>
            <span className="font-mono text-lg w-8 text-center text-[#e0e0e0]">{shardsBefore}</span>
            <button 
              onClick={() => setShardsBefore(shardsBefore + 1)}
              className="px-3 py-1 bg-[#2f2f2f] hover:bg-[#3a3a3a] text-[#e0e0e0] rounded border border-[#333]"
            >+</button>
          </div>
        </div>

        <div>
          <label className="block text-sm font-semibold text-[#e0e0e0] mb-2">New Shard Count (N+1)</label>
          <div className="flex items-center gap-2">
            <button 
              onClick={() => setShardsAfter(Math.max(1, shardsAfter - 1))}
              className="px-3 py-1 bg-[#2f2f2f] hover:bg-[#3a3a3a] text-[#e0e0e0] rounded border border-[#333]"
            >-</button>
            <span className="font-mono text-lg w-8 text-center text-[#e0e0e0]">{shardsAfter}</span>
            <button 
              onClick={() => setShardsAfter(shardsAfter + 1)}
              className="px-3 py-1 bg-[#2f2f2f] hover:bg-[#3a3a3a] text-[#e0e0e0] rounded border border-[#333]"
            >+</button>
          </div>
        </div>
      </div>

      {/* Impact Stats */}
      <div className={`p-4 rounded-lg border mb-6 flex items-center justify-between transition-colors duration-300 ${parseInt(percentMoved) > 30 ? 'bg-red-900/20 border-red-800' : 'bg-green-900/20 border-green-800'}`}>
        <div className="flex items-center gap-3">
          {parseInt(percentMoved) > 30 ? <AlertTriangle className="w-6 h-6 text-red-400" /> : <RefreshCcw className="w-6 h-6 text-green-400" />}
          <div>
            <div className="font-bold text-[#e0e0e0]">Reshuffle Impact</div>
            <div className="text-sm text-[#9a9a9a]">Keys that must be moved to a new server</div>
          </div>
        </div>
        <div className="text-right">
          <div className={`text-3xl font-bold ${parseInt(percentMoved) > 30 ? 'text-red-400' : 'text-green-400'}`}>
            {percentMoved}%
          </div>
          <div className="text-sm text-[#9a9a9a]">{data.movedCount} of {numKeys} records</div>
        </div>
      </div>

      {/* Visualization Grid */}
      <div className="bg-[#252525] p-4 rounded-lg border border-[#333]">
        <h3 className="text-sm font-semibold text-[#9a9a9a] uppercase tracking-wide mb-4">Record Mapping Visualization</h3>
        
        <div className="grid grid-cols-2 sm:grid-cols-3 md:grid-cols-4 lg:grid-cols-5 gap-3">
          {data.records.map((record) => (
            <div 
              key={record.id}
              className={`relative p-3 rounded-lg border-2 transition-all duration-500 ${
                record.hasMoved 
                  ? 'border-red-500 bg-red-900/30 opacity-100' 
                  : 'border-[#333] bg-[#1e1e1e] opacity-60'
              }`}
            >
              <div className="flex justify-between items-center mb-2">
                <span className="text-xs font-bold text-[#6b6b6b]">KEY {record.id}</span>
                {record.hasMoved && (
                  <span className="text-[10px] font-bold bg-red-900/50 text-red-400 px-1.5 py-0.5 rounded">MOVED</span>
                )}
              </div>

              <div className="flex items-center gap-2">
                <div className={`flex-1 text-center py-1 rounded text-xs font-mono border ${getShardColor(record.prevShard)}`}>
                  S{record.prevShard}
                </div>
                <ArrowRight className={`w-4 h-4 ${record.hasMoved ? 'text-red-400' : 'text-[#6b6b6b]'}`} />
                <div className={`flex-1 text-center py-1 rounded text-xs font-mono border ${getShardColor(record.newShard)}`}>
                  S{record.newShard}
                </div>
              </div>
            </div>
          ))}
        </div>
      </div>


    </div>
  );
};

const root = ReactDOM.createRoot(document.getElementById('modulo-hashing-root'));
root.render(<ModuloHashingVisualizer />);
</script>

<h2 id="conclusion">Conclusion</h2>
<p>Modulo hashing is simple and works well for static systems, but it becomes problematic when you need to scale. For some cases the number of records that need moving can go up as high as 93%. There are different ways of solving it. One way is to use consistent-hashing or lookup table.</p>]]></content><author><name></name></author><category term="Deep Dive" /><category term="Tutorial" /><category term="System Design" /><summary type="html"><![CDATA[Problem The current database cannot handle the volume of incoming write requests. During peak times, there are too many write/update requests incoming per second, so requests are taking longer to execute. You can buffer requests, but if the database cannot fulfill them faster than they arrive, it will overflow. Let’s say you work in a bank and cannot afford to drop any requests.]]></summary></entry><entry><title type="html">Gold Forecast</title><link href="https://bobrinik.github.io/2025/05/27/gold-forecast.html" rel="alternate" type="text/html" title="Gold Forecast" /><published>2025-05-27T00:00:00+00:00</published><updated>2025-05-27T00:00:00+00:00</updated><id>https://bobrinik.github.io/2025/05/27/gold-forecast</id><content type="html" xml:base="https://bobrinik.github.io/2025/05/27/gold-forecast.html"><![CDATA[<p><img src="/assets/images/2025-05-27-gold-forecast/image.png" alt="Gold and Inflation" /></p>

<p>We can see that the gold movement roughly follows inflation. However, it also follows gold buying done by other countries.</p>

<p><img src="/assets/images/2025-05-27-gold-forecast/image 1.png" alt="China Gold Reserves" /></p>

<p>For example, China has been increasing its gold reserves. It’s widely assumed that China does it to reduce risk of dependence on US dollar. It would decrease dependence on US dollar.</p>]]></content><author><name></name></author><category term="Forecast" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Launching Option Calculator</title><link href="https://bobrinik.github.io/2025/01/13/launching-option-calculator.html" rel="alternate" type="text/html" title="Launching Option Calculator" /><published>2025-01-13T00:00:00+00:00</published><updated>2025-01-13T00:00:00+00:00</updated><id>https://bobrinik.github.io/2025/01/13/launching-option-calculator</id><content type="html" xml:base="https://bobrinik.github.io/2025/01/13/launching-option-calculator.html"><![CDATA[<p>Built <a href="https://calculatemyoptions.click/">calculatemyoptions.click</a> website. It’s entirely hosted on AWS. It’s a SPA where static files are on S3 bucket and are served with CloudFront. Backend is R code hosted on Lambda. All of the infra is created/updated with AWS CDK.</p>

<p>In order to release this project, I had to figure out how to host R inside of a container and serve it with AWS Lambda. I’ve already done something similar in <a href="/2024/07/26/running-r-on-aws-lambda.html">Running R on AWS Lambda</a>, so I could re-use parts of learning from there and build on top of it.</p>

<h2 id="challenges">Challenges</h2>

<p>There were a couple of challenges that I encountered when working on this project:</p>

<h3 id="r-libraries-and-docker-image-size">R Libraries and Docker Image Size</h3>

<p>Not all R libraries were available for AWS Lambda image, so I had to compile a couple of them from source code. When compiling, too many intermediate artifacts were created which put the final image over 10GB (Docker images hosted on AWS Lambda have a limit of 10GB [1]).</p>

<p>I reduced the size of the Lambda container by using multi-stage Docker build process and copying only compiled binaries into a final AWS Lambda image. I was able to go from 11GB to around 4GB, and I could run R container with all libs on AWS Lambda (yay).</p>

<h3 id="frontend-development">Frontend Development</h3>

<p>The second challenge was the frontend since I’ve never done it before. Luckily ChatGPT helped me setup the React template that I could then modify and shape.</p>

<p>Also, CloudFront was a bit tricky to configure, specifically for configuring routes to Lambda function and making sure that the SPA could talk to Lambda and work across Firefox and Chrome.</p>

<h2 id="testing-and-release">Testing and Release</h2>

<p>After parts of the whole project had been configured, I did a couple of runs of integration testing and fixing. Once I checked that the skeleton and parts work together, I did a mini release on LinkedIn, to see what people say and if I can catch any errors with real traffic.</p>

<h2 id="takeaways">Takeaways</h2>

<p>Overall, it was a fun learning experience, and now I have deployment templates that I can leverage for future projects as well as knowledge about how website hosting on AWS is done.</p>

<h3 id="references">References</h3>

<ol>
  <li><a href="https://docs.aws.amazon.com/lambda/latest/dg/images-create.html">AWS Lambda container image size limits</a></li>
</ol>]]></content><author><name></name></author><category term="React" /><category term="R" /><category term="AWS" /><category term="2025-resolution" /><category term="january" /><summary type="html"><![CDATA[Built calculatemyoptions.click website. It’s entirely hosted on AWS. It’s a SPA where static files are on S3 bucket and are served with CloudFront. Backend is R code hosted on Lambda. All of the infra is created/updated with AWS CDK.]]></summary></entry><entry><title type="html">Running R on AWS Lambda</title><link href="https://bobrinik.github.io/2024/07/26/running-r-on-aws-lambda.html" rel="alternate" type="text/html" title="Running R on AWS Lambda" /><published>2024-07-26T00:00:00+00:00</published><updated>2024-07-26T00:00:00+00:00</updated><id>https://bobrinik.github.io/2024/07/26/running-r-on-aws-lambda</id><content type="html" xml:base="https://bobrinik.github.io/2024/07/26/running-r-on-aws-lambda.html"><![CDATA[<h2 id="whats-aws-lambda">What’s AWS Lambda?</h2>

<p>It’s a compute env managed by AWS. You can think about it as a service that has a <code class="language-plaintext highlighter-rouge">while true</code> loop that waits for incoming requests. When the request comes in, Lambda will call your code and pass a request to appropriate function.</p>

<p><img src="/assets/images/2024-07-26-running-r-on-aws-lambda/Untitled.png" alt="Lambda Architecture" /></p>

<h3 id="how-do-i-upload-my-r-code-to-lambda">How do I upload my R code to Lambda?</h3>

<p>Ok, not so fast. We cannot upload R code to Lambda directly, because Lambda does not support R runtime. Here’s <a href="https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtimes.html">the list of supported runtimes</a>. There’s a way to patch it, but you will keep running into issues when installing deps and you would need to do your own maintenance time to time. We don’t want that.</p>

<p>That’s why we are going to be using 🐳 Docker container to host the R env and when the request comes to Lambda, it will pass it to a running container.</p>

<p><img src="/assets/images/2024-07-26-running-r-on-aws-lambda/Untitled 1.png" alt="Docker Lambda Architecture" /></p>

<p>Lambda would pull an image from AWS ECR (host for docker images) and then run that image when the request comes in.</p>

<h3 id="so-whats-the-plan">So what’s the plan?</h3>

<ol>
  <li>Create docker image that will have our R script and all the deps that it needs</li>
  <li>Setup Docker Image Registry where you going to upload your images to</li>
  <li>Configure Lambda to use it (To continue, check the code in the repo)</li>
</ol>

<p>Check out the example repo: <a href="https://github.com/Bobrinik/r_on_lambda_example">https://github.com/Bobrinik/r_on_lambda_example</a></p>

<h3 id="trigger-your-lambda-from-console">Trigger your lambda from console</h3>

<p><img src="/assets/images/2024-07-26-running-r-on-aws-lambda/Untitled 2.png" alt="Lambda Console" /></p>

<ul>
  <li>31 seconds of startup time (initial speedup is lengthy, might be ok or pretty bad depending on your use case)</li>
</ul>

<h3 id="now-what-are-lambda-constraints">Now, what are Lambda constraints?</h3>

<ul>
  <li><strong>Startup time:</strong>
    <ul>
      <li>For my simple example it was around 31 seconds (it’s still the time that you pay for). The subsequent one is going to be much faster though, but still.</li>
    </ul>
  </li>
  <li><strong>Timeout:</strong>
    <ul>
      <li>15min max of runtime</li>
    </ul>
  </li>
  <li><strong>Memory:</strong>
    <ul>
      <li>10 GB</li>
    </ul>
  </li>
  <li><strong>CPU:</strong>
    <ul>
      <li>Proportional to memory; at 10GB it will give you around 6 vcpu</li>
    </ul>
  </li>
</ul>

<p><img src="/assets/images/2024-07-26-running-r-on-aws-lambda/Untitled 3.png" alt="Lambda Constraints" /></p>

<p><em>Taken from <a href="https://www.youtube.com/watch?v=rpL77KDN92Q">https://www.youtube.com/watch?v=rpL77KDN92Q</a></em></p>

<ul>
  <li>For price/power tuning: <a href="https://github.com/alexcasalboni/aws-lambda-power-tuning">https://github.com/alexcasalboni/aws-lambda-power-tuning</a></li>
</ul>

<h3 id="references">References</h3>

<ul>
  <li><a href="https://mdneuzerling.com/post/r-on-aws-lambda-with-containers/">https://mdneuzerling.com/post/r-on-aws-lambda-with-containers/</a></li>
  <li><a href="https://docs.aws.amazon.com/lambda/latest/dg/runtimes-walkthrough.html">https://docs.aws.amazon.com/lambda/latest/dg/runtimes-walkthrough.html</a></li>
</ul>]]></content><author><name></name></author><category term="R" /><category term="Docker" /><category term="AWS" /><category term="Lambda" /><summary type="html"><![CDATA[What’s AWS Lambda?]]></summary></entry><entry><title type="html">Pandas Tips And Tricks For Finance</title><link href="https://bobrinik.github.io/2024/06/16/pandas-tips-and-tricks-for-finance.html" rel="alternate" type="text/html" title="Pandas Tips And Tricks For Finance" /><published>2024-06-16T00:00:00+00:00</published><updated>2024-06-16T00:00:00+00:00</updated><id>https://bobrinik.github.io/2024/06/16/pandas-tips-and-tricks-for-finance</id><content type="html" xml:base="https://bobrinik.github.io/2024/06/16/pandas-tips-and-tricks-for-finance.html"><![CDATA[<h2 id="what-is-about">What is about?</h2>

<p>Here I’m tracking of the collection of useful functions for the analysis of time series with Pandas.</p>

<h2 id="correlation">Correlation</h2>

<ul>
  <li>Taken from <code class="language-plaintext highlighter-rouge">Python for Finance, 2nd Edition</code></li>
</ul>

<p><a href="https://learning.oreilly.com/library/view/python-for-finance/9781492024323/ch08.html#idm45322766017448">Python for Finance, 2nd Edition</a></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rets</span><span class="p">.</span><span class="n">corr</span><span class="p">()</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">56</span><span class="p">]:</span>           <span class="p">.</span><span class="n">SPX</span>      <span class="p">.</span><span class="n">VIX</span>
         <span class="p">.</span><span class="n">SPX</span>  <span class="mf">1.000000</span> <span class="o">-</span><span class="mf">0.804382</span>
         <span class="p">.</span><span class="n">VIX</span> <span class="o">-</span><span class="mf">0.804382</span>  <span class="mf">1.000000</span>
         

<span class="n">In</span> <span class="p">[</span><span class="mi">57</span><span class="p">]:</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">rets</span><span class="p">[</span><span class="s">'.SPX'</span><span class="p">].</span><span class="n">rolling</span><span class="p">(</span><span class="n">window</span><span class="o">=</span><span class="mi">252</span><span class="p">).</span><span class="n">corr</span><span class="p">(</span>
                           <span class="n">rets</span><span class="p">[</span><span class="s">'.VIX'</span><span class="p">]).</span><span class="n">plot</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
         <span class="n">ax</span><span class="p">.</span><span class="n">axhline</span><span class="p">(</span><span class="n">rets</span><span class="p">.</span><span class="n">corr</span><span class="p">().</span><span class="n">iloc</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="s">'r'</span><span class="p">);</span>
</code></pre></div></div>
<p><img src="/assets/images/2024-06-16-pandas-tips/uvxy_plot.png" alt="Rolling correlation plot" /></p>]]></content><author><name></name></author><category term="trading" /><category term="cheat-sheet" /><category term="Python" /><summary type="html"><![CDATA[What is about?]]></summary></entry><entry><title type="html">LLMs for clustering TO exchange tickers</title><link href="https://bobrinik.github.io/2024/04/16/llms-for-clustering-to-exchange-tickers.html" rel="alternate" type="text/html" title="LLMs for clustering TO exchange tickers" /><published>2024-04-16T00:00:00+00:00</published><updated>2024-04-16T00:00:00+00:00</updated><id>https://bobrinik.github.io/2024/04/16/llms-for-clustering-to-exchange-tickers</id><content type="html" xml:base="https://bobrinik.github.io/2024/04/16/llms-for-clustering-to-exchange-tickers.html"><![CDATA[<p>You can diversify portfolios across sectors. The idea is that each sector has different supply lines and revenue streams. So if something goes wrong with let’s say production of potash, it should not affect your tech sector.</p>

<p>I wanted to see if instead of using pre-defined sectors by some other organization; I can partition tickers based on their risk profile. For doing that, I could use knowledge compressed in OpenAI LLM.</p>

<p>So the idea is to use OpenAI embeddings of risks for clustering Toronto Exchange tickers. The hypothesis is to use those instead of sectors. If successful, it would allow to diversify across risks instead of volatility and expected return, or sectors.</p>

<p>Unfortunately, it didn’t work; I think the prompt or the way I was merging embeddings for risks was not ideal. Anyway, if someone wants to continue, the code is on GitHub.</p>

<p><a href="https://github.com/Bobrinik/finnancial_explorations/blob/main/llm_for_stock_risk_analysis/3.explore_risks.ipynb">View the notebook on GitHub →</a></p>]]></content><author><name></name></author><category term="trading" /><category term="portfolio" /><category term="llm" /><summary type="html"><![CDATA[You can diversify portfolios across sectors. The idea is that each sector has different supply lines and revenue streams. So if something goes wrong with let’s say production of potash, it should not affect your tech sector.]]></summary></entry><entry><title type="html">How to download portfolio composition from Wealthsimple</title><link href="https://bobrinik.github.io/2024/04/03/how-to-download-portfolio-composition-from-wealthsimple.html" rel="alternate" type="text/html" title="How to download portfolio composition from Wealthsimple" /><published>2024-04-03T00:00:00+00:00</published><updated>2024-04-03T00:00:00+00:00</updated><id>https://bobrinik.github.io/2024/04/03/how-to-download-portfolio-composition-from-wealthsimple</id><content type="html" xml:base="https://bobrinik.github.io/2024/04/03/how-to-download-portfolio-composition-from-wealthsimple.html"><![CDATA[<p>In short, people are asking for capabilities to export data from Wealthsimple so that they can track it in Excel or do some Python modelling. So far, the solutions are to either use Wealthica that is using some unknown API or some sort of a crawler to extract that information (you would need to give it your creds, not ideal) you would also need to pay for ability to download it from them or you can manually copy paste the information.</p>

<h2 id="solution">Solution</h2>

<blockquote>
  <p>Grease Monkey is a popular browser extension that allows users to customize the functionality and appearance of websites they visit. It works with various web browsers, including Google Chrome, Mozilla Firefox, and others. Grease Monkey uses user scripts, which are small JavaScript programs, to modify the behavior of web pages. Grease Monkey works by injecting user scripts into web pages as they are loaded in your browser.  - ChatGPT</p>
</blockquote>

<p>The idea is to inject script into webpage that would add functionality which is lacking. That script would get necessary data from the loaded webpage and put it into a CSV. It would also add a download button to the webpage so that person could download it.</p>

<p>That’s how it looks.</p>

<div class="language-jsx highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// ==UserScript==</span>
<span class="c1">// @name          jQuery Example</span>
<span class="c1">// @require       https://cdnjs.cloudflare.com/ajax/libs/jquery/3.7.1/jquery.min.js</span>
<span class="c1">// ==/UserScript==</span>

<span class="kd">function</span> <span class="nx">getFormattedDate</span><span class="p">()</span> <span class="p">{</span>
    <span class="kd">var</span> <span class="nx">dateObj</span> <span class="o">=</span> <span class="k">new</span> <span class="nb">Date</span><span class="p">();</span>
    <span class="kd">var</span> <span class="nx">year</span> <span class="o">=</span> <span class="nx">dateObj</span><span class="p">.</span><span class="nx">getFullYear</span><span class="p">();</span>
    <span class="kd">var</span> <span class="nx">month</span> <span class="o">=</span> <span class="p">(</span><span class="dl">"</span><span class="s2">0</span><span class="dl">"</span> <span class="o">+</span> <span class="p">(</span><span class="nx">dateObj</span><span class="p">.</span><span class="nx">getMonth</span><span class="p">()</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)).</span><span class="nx">slice</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">);</span> <span class="c1">// getMonth() is zero-based</span>
    <span class="kd">var</span> <span class="nx">day</span> <span class="o">=</span> <span class="p">(</span><span class="dl">"</span><span class="s2">0</span><span class="dl">"</span> <span class="o">+</span> <span class="nx">dateObj</span><span class="p">.</span><span class="nx">getDate</span><span class="p">()).</span><span class="nx">slice</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">);</span>

    <span class="k">return</span> <span class="s2">`</span><span class="p">${</span><span class="nx">year</span><span class="p">}</span><span class="s2">-</span><span class="p">${</span><span class="nx">month</span><span class="p">}</span><span class="s2">-</span><span class="p">${</span><span class="nx">day</span><span class="p">}</span><span class="s2">`</span><span class="p">;</span>
<span class="p">}</span>

<span class="nb">window</span><span class="p">.</span><span class="nx">onload</span> <span class="o">=</span> <span class="kd">function</span><span class="p">()</span> <span class="p">{</span>
    <span class="nx">setTimeout</span><span class="p">(</span><span class="kd">function</span> <span class="p">()</span> <span class="p">{</span>
      <span class="nx">jQuery</span><span class="p">(</span><span class="nb">document</span><span class="p">).</span><span class="nx">ready</span><span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">$</span><span class="p">)</span> <span class="p">{</span>
          <span class="kd">let</span> <span class="nx">downloadButton</span> <span class="o">=</span> <span class="nb">document</span><span class="p">.</span><span class="nx">createElement</span><span class="p">(</span><span class="dl">"</span><span class="s2">button</span><span class="dl">"</span><span class="p">);</span>
          <span class="nx">downloadButton</span><span class="p">.</span><span class="nx">innerHTML</span> <span class="o">=</span> <span class="dl">"</span><span class="s2">Download CSV</span><span class="dl">"</span><span class="p">;</span>
          <span class="nx">downloadButton</span><span class="p">.</span><span class="nx">id</span> <span class="o">=</span> <span class="dl">"</span><span class="s2">csvButton</span><span class="dl">"</span><span class="p">;</span>
          <span class="nx">downloadButton</span><span class="p">.</span><span class="nx">style</span><span class="p">.</span><span class="nx">padding</span> <span class="o">=</span> <span class="dl">"</span><span class="s2">20px</span><span class="dl">"</span><span class="p">;</span> 
        
          <span class="nb">document</span><span class="p">.</span><span class="nx">body</span><span class="p">.</span><span class="nx">insertBefore</span><span class="p">(</span><span class="nx">downloadButton</span><span class="p">,</span> <span class="nb">document</span><span class="p">.</span><span class="nx">body</span><span class="p">.</span><span class="nx">firstChild</span><span class="p">);</span>

          <span class="kd">function</span> <span class="nx">generateCSV</span><span class="p">()</span> <span class="p">{</span>
              <span class="kd">let</span> <span class="nx">separator</span> <span class="o">=</span> <span class="dl">"</span><span class="s2">,</span><span class="dl">"</span><span class="p">;</span>
              <span class="kd">let</span> <span class="nx">csvContent</span> <span class="o">=</span> <span class="p">[];</span>
              <span class="kd">let</span> <span class="nx">header</span> <span class="o">=</span> <span class="p">[</span><span class="dl">'</span><span class="s1">Security</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">Name</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">Total_Value</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">Quantity</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">All_Time_Return</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">Per_All_time_Return</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">Today_Price</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">Per_Today_Price</span><span class="dl">'</span><span class="p">];</span>
              
              <span class="nx">csvContent</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">header</span><span class="p">.</span><span class="nx">join</span><span class="p">(</span><span class="nx">separator</span><span class="p">));</span>
                          
              <span class="nx">$</span><span class="p">(</span><span class="dl">"</span><span class="s2">tbody tr</span><span class="dl">"</span><span class="p">).</span><span class="nx">each</span><span class="p">(</span><span class="kd">function</span> <span class="p">()</span> <span class="p">{</span>
                  <span class="kd">let</span> <span class="nx">row</span> <span class="o">=</span> <span class="p">[];</span>
                  <span class="nx">$</span><span class="p">(</span><span class="k">this</span><span class="p">).</span><span class="nx">find</span><span class="p">(</span><span class="dl">"</span><span class="s2">td</span><span class="dl">"</span><span class="p">).</span><span class="nx">each</span><span class="p">(</span><span class="kd">function</span> <span class="p">()</span> <span class="p">{</span>
                      <span class="nx">$</span><span class="p">(</span><span class="k">this</span><span class="p">).</span><span class="nx">find</span><span class="p">(</span><span class="dl">"</span><span class="s2">p</span><span class="dl">"</span><span class="p">).</span><span class="nx">each</span><span class="p">(</span><span class="kd">function</span><span class="p">()</span> <span class="p">{</span>
                          <span class="nx">row</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">$</span><span class="p">(</span><span class="k">this</span><span class="p">).</span><span class="nx">text</span><span class="p">());</span>
                      <span class="p">});</span>
                  <span class="p">});</span>
                
                  <span class="k">if</span><span class="p">(</span><span class="nx">row</span><span class="p">.</span><span class="nx">length</span> <span class="o">==</span> <span class="mi">9</span><span class="p">)</span> <span class="p">{</span>
                    <span class="nx">row</span> <span class="o">=</span> <span class="nx">row</span><span class="p">.</span><span class="nx">slice</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
                  <span class="p">}</span>
                  <span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="nx">row</span><span class="p">);</span>
                  <span class="nx">csvContent</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">row</span><span class="p">.</span><span class="nx">join</span><span class="p">(</span><span class="nx">separator</span><span class="p">));</span>
              <span class="p">});</span>
              <span class="k">return</span> <span class="nx">csvContent</span><span class="p">.</span><span class="nx">join</span><span class="p">(</span><span class="dl">"</span><span class="se">\n</span><span class="dl">"</span><span class="p">);</span>
          <span class="p">}</span>

          <span class="nb">document</span><span class="p">.</span><span class="nx">getElementById</span><span class="p">(</span><span class="dl">"</span><span class="s2">csvButton</span><span class="dl">"</span><span class="p">).</span><span class="nx">addEventListener</span><span class="p">(</span><span class="dl">"</span><span class="s2">click</span><span class="dl">"</span><span class="p">,</span> <span class="kd">function</span> <span class="p">()</span> <span class="p">{</span>
              <span class="kd">let</span> <span class="nx">accountName</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="dl">"</span><span class="s2">.knseRw &gt; div:nth-child(1)</span><span class="dl">"</span><span class="p">).</span><span class="nx">text</span><span class="p">();</span>
              <span class="kd">let</span> <span class="nx">csvContent</span> <span class="o">=</span> <span class="nx">generateCSV</span><span class="p">();</span>
              <span class="kd">var</span> <span class="nx">hiddenElement</span> <span class="o">=</span> <span class="nb">document</span><span class="p">.</span><span class="nx">createElement</span><span class="p">(</span><span class="dl">'</span><span class="s1">a</span><span class="dl">'</span><span class="p">);</span>
              <span class="nx">hiddenElement</span><span class="p">.</span><span class="nx">href</span> <span class="o">=</span> <span class="dl">'</span><span class="s1">data:text/csv;charset=utf-8,</span><span class="dl">'</span> <span class="o">+</span> <span class="nb">encodeURI</span><span class="p">(</span><span class="nx">csvContent</span><span class="p">);</span>
              <span class="nx">hiddenElement</span><span class="p">.</span><span class="nx">target</span> <span class="o">=</span> <span class="dl">'</span><span class="s1">_blank</span><span class="dl">'</span><span class="p">;</span>
              <span class="nx">hiddenElement</span><span class="p">.</span><span class="nx">download</span> <span class="o">=</span> <span class="nx">accountName</span><span class="o">+</span><span class="dl">'</span><span class="s1">_portfolio_</span><span class="dl">'</span><span class="o">+</span><span class="nx">getFormattedDate</span><span class="p">()</span><span class="o">+</span><span class="dl">'</span><span class="s1">.csv</span><span class="dl">'</span><span class="p">;</span>
              <span class="nx">hiddenElement</span><span class="p">.</span><span class="nx">click</span><span class="p">();</span>
          <span class="p">});</span>
      <span class="p">});</span>
    <span class="p">},</span> <span class="mi">5000</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>You can read more and follow instructions <a href="https://github.com/Bobrinik/wealthsimple_utilities/tree/main">here</a>.</p>]]></content><author><name></name></author><category term="tutorial" /><category term="wealthsimple" /><category term="hacky_solution" /><summary type="html"><![CDATA[In short, people are asking for capabilities to export data from Wealthsimple so that they can track it in Excel or do some Python modelling. So far, the solutions are to either use Wealthica that is using some unknown API or some sort of a crawler to extract that information (you would need to give it your creds, not ideal) you would also need to pay for ability to download it from them or you can manually copy paste the information.]]></summary></entry><entry><title type="html">Compute OHCL from Tick Data with Google BigQuery</title><link href="https://bobrinik.github.io/2024/03/01/compute-ohcl-from-tick-data-with-google-bigquery.html" rel="alternate" type="text/html" title="Compute OHCL from Tick Data with Google BigQuery" /><published>2024-03-01T00:00:00+00:00</published><updated>2024-03-01T00:00:00+00:00</updated><id>https://bobrinik.github.io/2024/03/01/compute-ohcl-from-tick-data-with-google-bigquery</id><content type="html" xml:base="https://bobrinik.github.io/2024/03/01/compute-ohcl-from-tick-data-with-google-bigquery.html"><![CDATA[<h2 id="pre-reqs-to-follow-this-tutorial">Pre-reqs to follow this tutorial</h2>

<ul>
  <li>Know what’s gcloud bucket and how to copy files to it</li>
  <li>Have <code class="language-plaintext highlighter-rouge">gcloud</code> tool configured on local</li>
  <li>Know how to use Python</li>
  <li>Know how to use bash</li>
</ul>

<h3 id="getting-data">Getting data</h3>

<p>Finnhub provides tick level data for TSX for couple of years that you can bulk download from 2021 up to last month.
<img src="/assets/images/2024-03-01-compute-ohcl/finnhub-example.webp" alt="Finnhub bulk download" /></p>

<p>You can download each one separately or use the script below to get everything</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>

<span class="nv">TOKEN</span><span class="o">=</span><span class="s2">"YOUR_TOKEN"</span>
<span class="nv">DIR_NAME</span><span class="o">=</span><span class="s2">"./finnhub_data/"</span>

<span class="k">for </span>year <span class="k">in</span> <span class="o">{</span>2021..2023<span class="o">}</span><span class="p">;</span> <span class="k">do 
    for </span>month <span class="k">in</span> <span class="o">{</span>1..12<span class="o">}</span><span class="p">;</span> <span class="k">do</span> 
        <span class="c"># Get the redirect URL</span>
        <span class="nv">REDIRECT_URL</span><span class="o">=</span><span class="si">$(</span>curl <span class="nt">-s</span> <span class="s2">"https://finnhub.io/api/v1/bulk-download?exchange=to&amp;dataType=trade&amp;year=</span><span class="nv">$year</span><span class="s2">&amp;month=</span><span class="nv">$month</span><span class="s2">&amp;token=</span><span class="nv">$TOKEN</span><span class="s2">"</span> | <span class="nb">grep</span> <span class="nt">-oE</span> <span class="s1">'href="[^"]+"'</span> | <span class="nb">cut</span> <span class="nt">-d</span><span class="s1">'"'</span> <span class="nt">-f2</span><span class="si">)</span>
        <span class="nb">mkdir</span> <span class="nt">-p</span> <span class="s2">"</span><span class="nv">$DIR_NAME</span><span class="s2">"</span>
        <span class="c"># Follow the redirect if a URL was found</span>
        <span class="k">if</span> <span class="o">[[</span> <span class="o">!</span> <span class="nt">-z</span> <span class="s2">"</span><span class="nv">$REDIRECT_URL</span><span class="s2">"</span> <span class="o">]]</span><span class="p">;</span> <span class="k">then
            </span>curl <span class="nt">-o</span> <span class="s2">"to_trade_</span><span class="nv">$year</span><span class="s2">-</span><span class="nv">$month</span><span class="s2">.tar"</span> <span class="s2">"</span><span class="nv">$REDIRECT_URL</span><span class="s2">"</span>
            <span class="nb">mv</span> <span class="s2">"to_trade_</span><span class="nv">$year</span><span class="s2">-</span><span class="nv">$month</span><span class="s2">.tar"</span> <span class="nv">$DIR_NAME</span>
        <span class="k">fi

        </span><span class="nb">sleep </span>1
    <span class="k">done
done</span>
</code></pre></div></div>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Copy paste the code into file, say fetch_finnhub_archive.sh</span>
<span class="nb">chmod</span> +x fetch_finnhub_archive.sh
./fetch_finnhub_archive.sh
</code></pre></div></div>

<p>Once you are done, you will end up with 94GB of files. Now let’s say you want to convert this to 1-min OHCL data. You can use pandas and do the processing, or you can use Google BigQuery to do that.</p>

<h2 id="compute-ohcl-with-google-bigquery">Compute OHCL with Google BigQuery</h2>

<ol>
  <li>Untar files</li>
  <li>You will end up with many small files that you can compress into bigger files</li>
  <li>Upload bigger files to Google Bucket</li>
  <li>Import files into BigQuery table</li>
  <li>Compute OHCL from it and store results in a separate table</li>
  <li>Export the ohcl table into Google Bucket</li>
  <li>Download result to your local</li>
  <li>Costs</li>
</ol>

<h3 id="untar-all-of-your-tick-archives">Untar all of your tick archives</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="k">for </span>file <span class="k">in</span> <span class="nv">$1</span>/<span class="k">*</span>.tar<span class="p">;</span> <span class="k">do</span>
    <span class="c"># Extract the tar file into the directory</span>
    <span class="nb">echo</span> <span class="s2">"Extracting </span><span class="nv">$file</span><span class="s2"> to </span><span class="nv">$dir_name</span><span class="s2">..."</span>
    <span class="nv">dir_name</span><span class="o">=</span><span class="s2">"./uncompressed/</span><span class="k">${</span><span class="nv">file</span><span class="p">##*/</span><span class="k">}</span><span class="s2">"</span>
    <span class="nb">mkdir</span> <span class="nt">-p</span> <span class="nv">$dir_name</span>
    <span class="nb">tar</span> <span class="nt">-xf</span> <span class="s2">"</span><span class="nv">$file</span><span class="s2">"</span> <span class="nt">-C</span> <span class="s2">"</span><span class="nv">$dir_name</span><span class="s2">"</span>
<span class="k">done</span>
</code></pre></div></div>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Copy and paste into a script called uncompress_finnhub_archive.sh</span>
<span class="nb">chmod</span> +x uncompress_finnhub_archive.sh
./uncompress_finnhub_archive.sh ./finnhub_data
</code></pre></div></div>

<p>After you run this script and <code class="language-plaintext highlighter-rouge">cd uncompressed/to_trade_2021-1</code> and run <code class="language-plaintext highlighter-rouge">ls -hl</code>. You will see something like this.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>total 2.5M
drwx------ 2 user user 124K Jan  5  2021 2021-01-04
drwx------ 2 user user 120K Jan  5  2021 2021-01-05
drwx------ 2 user user 124K Jan  6  2021 2021-01-06
drwx------ 2 user user 116K Jan  7  2021 2021-01-07
drwx------ 2 user user 128K Jan  8  2021 2021-01-08
drwx------ 2 user user 128K Jan 12  2021 2021-01-11
drwx------ 2 user user 124K Jan 13  2021 2021-01-12
drwx------ 2 user user 124K Jan 14  2021 2021-01-13
drwx------ 2 user user 124K Jan 15  2021 2021-01-14
drwx------ 2 user user 124K Jan 15  2021 2021-01-15
drwx------ 2 user user 120K Jan 19  2021 2021-01-18
drwx------ 2 user user 120K Jan 19  2021 2021-01-19
drwx------ 2 user user 124K Jan 20  2021 2021-01-20
drwx------ 2 user user 120K Jan 21  2021 2021-01-21
drwx------ 2 user user 124K Jan 23  2021 2021-01-22
drwx------ 2 user user 128K Jan 26  2021 2021-01-25
drwx------ 2 user user 124K Jan 27  2021 2021-01-26
drwx------ 2 user user 124K Jan 27  2021 2021-01-27
drwx------ 2 user user 124K Jan 28  2021 2021-01-28
drwx------ 2 user user 124K Jan 31  2021 2021-01-29
</code></pre></div></div>

<p>How many files are there in total and what’s their average size?</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">find</span> <span class="s">"uncompressed"</span> <span class="o">-</span><span class="nb">type</span> <span class="n">f</span> <span class="o">|</span> <span class="n">wc</span> <span class="o">-</span><span class="n">l</span>
<span class="mi">2490838</span>
<span class="n">find</span> <span class="s">"uncompressed"</span> <span class="o">-</span><span class="nb">type</span> <span class="n">f</span> <span class="o">-</span><span class="k">exec</span> <span class="n">du</span> <span class="o">-</span><span class="n">k</span> <span class="p">{}</span> <span class="o">+</span> <span class="o">|</span> <span class="n">awk</span> <span class="s">'{sum += $1} END {print sum}'</span>
<span class="mi">12081404</span>

<span class="err">❯</span> <span class="n">python3</span>
<span class="o">&gt;&gt;&gt;</span> <span class="mi">12081404</span> <span class="o">/</span> <span class="mi">2490838</span>
<span class="mf">4.85033711546074</span> <span class="c1"># Kbs
</span></code></pre></div></div>

<ul>
  <li>What we see is that we have lots of small files, and it will take lots of time to upload each one separately to Google Cloud bucket for further processing.</li>
  <li>Instead let’s collate those together into larger <code class="language-plaintext highlighter-rouge">.csv</code> files</li>
</ul>

<p>To do this, let’s use the script below. Note, you need to install <code class="language-plaintext highlighter-rouge">pandas</code> and <code class="language-plaintext highlighter-rouge">tqdm</code> libraries.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span>

<span class="k">for</span> <span class="nb">dir</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">listdir</span><span class="p">(</span><span class="s">"./uncompressed"</span><span class="p">),</span> <span class="n">desc</span><span class="o">=</span><span class="s">"Processing months"</span><span class="p">):</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="k">for</span> <span class="nb">file</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">listdir</span><span class="p">(</span><span class="sa">f</span><span class="s">"./uncompressed/</span><span class="si">{</span><span class="nb">dir</span><span class="si">}</span><span class="s">"</span><span class="p">),</span> <span class="n">desc</span><span class="o">=</span><span class="s">"Processing days"</span><span class="p">):</span>
            <span class="n">tables</span> <span class="o">=</span> <span class="p">[]</span>
            <span class="n">file_name</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"./transformed/transformed_</span><span class="si">{</span><span class="nb">dir</span><span class="si">}</span><span class="s">_</span><span class="si">{</span><span class="nb">file</span><span class="si">}</span><span class="s">.csv"</span>
            <span class="k">if</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">exists</span><span class="p">(</span><span class="n">file_name</span><span class="p">):</span>
                <span class="k">pass</span>
            <span class="k">for</span> <span class="n">asset</span> <span class="ow">in</span> <span class="n">os</span><span class="p">.</span><span class="n">listdir</span><span class="p">(</span><span class="sa">f</span><span class="s">"./uncompressed/</span><span class="si">{</span><span class="nb">dir</span><span class="si">}</span><span class="s">/</span><span class="si">{</span><span class="nb">file</span><span class="si">}</span><span class="s">"</span><span class="p">):</span>
                <span class="n">symbol</span> <span class="o">=</span> <span class="n">asset</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">".csv.gz"</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
                <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="sa">f</span><span class="s">"./uncompressed/</span><span class="si">{</span><span class="nb">dir</span><span class="si">}</span><span class="s">/</span><span class="si">{</span><span class="nb">file</span><span class="si">}</span><span class="s">/</span><span class="si">{</span><span class="n">asset</span><span class="si">}</span><span class="s">"</span><span class="p">,</span> <span class="n">compression</span><span class="o">=</span><span class="s">'gzip'</span><span class="p">)</span>
                <span class="n">df</span><span class="p">[</span><span class="s">"symbol"</span><span class="p">]</span> <span class="o">=</span> <span class="n">symbol</span>
                <span class="n">tables</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>

            <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span><span class="n">tables</span><span class="p">)</span>
            <span class="n">os</span><span class="p">.</span><span class="n">makedirs</span><span class="p">(</span><span class="s">"./transformed"</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
            <span class="n">df</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">file_name</span><span class="p">)</span>
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Skipping"</span><span class="p">)</span>
</code></pre></div></div>

<p>So how many files do we have now?</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">find</span> <span class="s">"transformed"</span> <span class="o">-</span><span class="nb">type</span> <span class="n">f</span> <span class="o">|</span> <span class="n">wc</span> <span class="o">-</span><span class="n">l</span>
 <span class="mi">749</span>
</code></pre></div></div>

<p>As you can see, we have fewer files and those files are much bigger. Now, it’s more manageable to load everything into Google bucket and process it with BigQuery.</p>

<p>At this point, you are going to have to upload multiple files to a bucket from local by using the following:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gsutil <span class="nt">-m</span> <span class="nb">cp</span> <span class="nt">-r</span> transformed gs://your-bucket-datalake/finnhub_transformed
</code></pre></div></div>

<p>Depending on your upload speed, it might take some time to upload. You can do all of the above steps on Google Compute, and the upload speed from Google Compute to Google Bucket will not be an issue.</p>

<h2 id="import-files-into-bigquery">Import files into BigQuery</h2>

<ol>
  <li>Create a dataset in BigQuery</li>
  <li>Create a table and specify path to a location on Google Storage bucket that contains all of the uncompressed files: <code class="language-plaintext highlighter-rouge">my-bucket-names/finnhub_transformed/*</code></li>
  <li>Don’t forget to enable <code class="language-plaintext highlighter-rouge">Schema Auto Detect</code></li>
</ol>

<p><img src="/assets/images/2024-03-01-compute-ohcl/bigquery-create-table.png" alt="BigQuery Create Table" /></p>

<h2 id="compute-ohcl-from-it-and-store-results-in-a-separate-table">Compute OHCL from it and store results in a separate table</h2>

<p>Now that our data is within the BigQuery table, we can use BigQuery SQL to compute OHCL.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">trade_data</span><span class="p">.</span><span class="n">one_minute_ohcl</span> <span class="k">AS</span>

<span class="k">WITH</span> <span class="n">MinuteRounded</span> <span class="k">AS</span> <span class="p">(</span>
  <span class="c1">-- This subquery rounds timestamps to the nearest minute</span>
  <span class="k">SELECT</span>
    <span class="n">TIMESTAMP_TRUNC</span><span class="p">(</span><span class="n">TIMESTAMP_MILLIS</span><span class="p">(</span><span class="nb">timestamp</span><span class="p">),</span> <span class="k">MINUTE</span><span class="p">)</span> <span class="k">AS</span> <span class="n">minute_timestamp</span><span class="p">,</span>
    <span class="n">symbol</span><span class="p">,</span>
    <span class="n">price</span><span class="p">,</span>
    <span class="n">volume</span><span class="p">,</span>
    <span class="nb">timestamp</span>  <span class="c1">-- Include the raw timestamp</span>
  <span class="k">FROM</span>
    <span class="n">trade_data</span><span class="p">.</span><span class="n">tick_data</span>
<span class="p">),</span>

<span class="n">AggregatedData</span> <span class="k">AS</span> <span class="p">(</span>
  <span class="k">SELECT</span>
    <span class="n">minute_timestamp</span><span class="p">,</span>
    <span class="n">symbol</span><span class="p">,</span>
    <span class="n">FIRST_VALUE</span><span class="p">(</span><span class="n">price</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">w</span> <span class="k">AS</span> <span class="k">open</span><span class="p">,</span>
    <span class="k">MAX</span><span class="p">(</span><span class="n">price</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">w</span> <span class="k">AS</span> <span class="n">high</span><span class="p">,</span>
    <span class="k">MIN</span><span class="p">(</span><span class="n">price</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">w</span> <span class="k">AS</span> <span class="n">low</span><span class="p">,</span>
    <span class="n">LAST_VALUE</span><span class="p">(</span><span class="n">price</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">w</span> <span class="k">AS</span> <span class="k">close</span><span class="p">,</span>
    <span class="k">SUM</span><span class="p">(</span><span class="n">volume</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">w</span> <span class="k">AS</span> <span class="n">volume</span>
  <span class="k">FROM</span>
    <span class="n">MinuteRounded</span>
  <span class="k">WINDOW</span> <span class="n">w</span> <span class="k">AS</span> <span class="p">(</span>
    <span class="k">PARTITION</span> <span class="k">BY</span> <span class="n">symbol</span><span class="p">,</span> <span class="n">minute_timestamp</span>
    <span class="k">ORDER</span> <span class="k">BY</span> <span class="nb">timestamp</span>
    <span class="k">ROWS</span> <span class="k">BETWEEN</span> <span class="n">UNBOUNDED</span> <span class="k">PRECEDING</span> <span class="k">AND</span> <span class="n">UNBOUNDED</span> <span class="k">FOLLOWING</span>
  <span class="p">)</span>
<span class="p">)</span>

<span class="k">SELECT</span>
  <span class="n">minute_timestamp</span><span class="p">,</span>
  <span class="n">symbol</span><span class="p">,</span>
  <span class="k">open</span><span class="p">,</span>
  <span class="n">high</span><span class="p">,</span>
  <span class="n">low</span><span class="p">,</span>
  <span class="k">close</span><span class="p">,</span>
  <span class="n">volume</span>
<span class="k">FROM</span>
  <span class="n">AggregatedData</span>
<span class="k">GROUP</span> <span class="k">BY</span> 
  <span class="n">minute_timestamp</span><span class="p">,</span> <span class="n">symbol</span><span class="p">,</span> <span class="k">open</span><span class="p">,</span> <span class="n">high</span><span class="p">,</span> <span class="n">low</span><span class="p">,</span> <span class="k">close</span><span class="p">,</span> <span class="n">volume</span>
<span class="k">ORDER</span> <span class="k">BY</span> 
  <span class="n">symbol</span><span class="p">,</span> <span class="n">minute_timestamp</span><span class="p">;</span>
</code></pre></div></div>

<p>Once the above command runs, you are going to have another table called <code class="language-plaintext highlighter-rouge">one_minute_ohcl</code> that you can export to bucket in the UI. Note that you might receive an error saying that the export should happen into the bucket which is within the same region that you read data from. The error will also tell you where your bucket needs to be. To resolve this you can create a new bucket with correct region.</p>

<h2 id="costs">Costs</h2>

<ul>
  <li>Finnhub subscription <strong><code class="language-plaintext highlighter-rouge">$149.97 USD</code></strong> for a quarter (can’t have lower than that)</li>
  <li>[Optional] ~3hr of compute for downloading and processing data ~ <code class="language-plaintext highlighter-rouge">5 USD</code> max</li>
  <li>Big Query is going to be free since you are going to fall into free tier with this data volume</li>
</ul>]]></content><author><name></name></author><category term="trading" /><category term="data processing" /><category term="ohcl" /><category term="tutorial" /><category term="gcloud" /><summary type="html"><![CDATA[Pre-reqs to follow this tutorial]]></summary></entry><entry><title type="html">Predicting the winner of Kentucky Derby</title><link href="https://bobrinik.github.io/2022/05/18/predicting-kentucky-derby-winner.html" rel="alternate" type="text/html" title="Predicting the winner of Kentucky Derby" /><published>2022-05-18T00:00:00+00:00</published><updated>2022-05-18T00:00:00+00:00</updated><id>https://bobrinik.github.io/2022/05/18/predicting-kentucky-derby-winner</id><content type="html" xml:base="https://bobrinik.github.io/2022/05/18/predicting-kentucky-derby-winner.html"><![CDATA[<p>There is a horse race called Kentucky Derby. People are betting on the outcomes of this race. Let’s do an analysis to see if we can get an edge over other people.</p>

<p>The Kentucky Derby is one of the most prestigious horse racing events in the world, attracting millions of viewers and bettors alike. With so much money on the line, can data analysis give us an advantage over the average bettor?</p>

<p>In this analysis, we’ll explore historical data, track conditions, horse statistics, and other factors that might influence race outcomes.</p>

<p><a href="https://community.wolfram.com/groups/-/m/t/2526950">Read the full notebook on Wolfram Community →</a></p>]]></content><author><name></name></author><category term="Wolfram Language" /><category term="Deep Dive" /><category term="Sport Betting" /><summary type="html"><![CDATA[There is a horse race called Kentucky Derby. People are betting on the outcomes of this race. Let’s do an analysis to see if we can get an edge over other people.]]></summary></entry></feed>