<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<link href="https://python.code-maven.com/atom" rel="self" />
<title>Python programming language</title>
<id>https://python.code-maven.com</id>
<updated>2026-02-04T07:40:01</updated>

  <entry>
    <title>Adding GitHub Actions to qrcode-pretty</title>
    <summary type="html"><![CDATA[]]></summary>
    <updated>2026-02-04T07:40:01Z</updated>
    <pubDate>2026-02-04T07:40:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/adding-github-action-to-qrcode-pretty-video" />
    <id>https://python.code-maven.com/adding-github-action-to-qrcode-pretty-video</id>
    <content type="html"><![CDATA[<p><a href="https://osdc.code-maven.com/python">OSDC Python</a></p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/W558rcFx_HY" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>Exploring qrcode-pretty and adding tests to it</title>
    <summary type="html"><![CDATA[]]></summary>
    <updated>2026-02-04T07:30:01Z</updated>
    <pubDate>2026-02-04T07:30:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/adding-tests-to-qrcode-pretty-video" />
    <id>https://python.code-maven.com/adding-tests-to-qrcode-pretty-video</id>
    <content type="html"><![CDATA[<p><a href="https://osdc.code-maven.com/python">OSDC Python</a></p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/QQuKZqfs8WI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>Fix speed</title>
    <summary type="html"><![CDATA[]]></summary>
    <updated>2025-10-24T13:30:01Z</updated>
    <pubDate>2025-10-24T13:30:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/fixed-speed" />
    <id>https://python.code-maven.com/fixed-speed</id>
    <content type="html"><![CDATA[<p>Based on a recent discussion about teaching programming and if speed of execution matters I wrote this snippet.</p>
<p><strong><a href="https://github.com/szabgab/python.code-maven.com/tree/main/examples/fix-speed/calculate.py">examples/fix-speed/calculate.py</a></strong></p>
<pre><code class="language-python">import sys

def run(X, Y):
    total = 0
    for x in range(X):
        for y in range(Y):
            width, height = read_config()
            total += x*width + y*height
    print(total)

def read_config():
    import csv
    with open('config.csv', newline='') as fh:
        reader = csv.DictReader(fh)
        for row in reader:
            return int(row['width']), int(row['height'])


if __name__ == &quot;__main__&quot;:
    if len(sys.argv) != 3:
        exit(&quot;Usage: {sys.argv[0]} X Y&quot;)
    X = int(sys.argv[1])
    Y = int(sys.argv[2])
    run(X, Y)

</code></pre>
<p>Can you suggest how to improve the speed of this code?</p>
<p>In order to try it create a file called <code>config.csv</code> with this content</p>
<p><strong><a href="https://github.com/szabgab/python.code-maven.com/tree/main/examples/fix-speed/config.csv">examples/fix-speed/config.csv</a></strong></p>
<pre><code class="language-csv">width,height
23,19

</code></pre>
<p>and then run</p>
<pre><code>time python3 calculate.py 1000 500
</code></pre>
<p>On my computer this takes 5 seconds to run.</p>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>How to build a microservice with Python + FastAPI to switch from RDS to DynamoDB with Nikita Baryshev</title>
    <summary type="html"><![CDATA[]]></summary>
    <updated>2025-05-20T08:30:01Z</updated>
    <pubDate>2025-05-20T08:30:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/how-to-build-a-microservice-with-python-and-fastapi" />
    <id>https://python.code-maven.com/how-to-build-a-microservice-with-python-and-fastapi</id>
    <content type="html"><![CDATA[<iframe width="560" height="315" src="https://www.youtube.com/embed/SNJJZOKBzoc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<p><img src="images/nikita-baryshev.jpeg" alt="Nikita Baryshev" /></p>
<p><a href="https://www.linkedin.com/in/nikita-baryshev/">Nikita Baryshev</a></p>
<p>Nikita writes:</p>
<p>The microservice processed all requests between different clients and DDB. In addition to this, during the transfer period, both RDS and DDB were supported before the full switch to DDB. I can talk about the general approaches I used to build this microservice, how I worked with the legacy code, monitoring, and what was the outcome. Also, I will give a summary of all the pros and cons I faced and things that you could do better from the beginning.</p>
<p>I'm a full-stack developer currently working at Check Point in Tel Aviv. My stack is Angular + Python (Flask, FastApi). I'm also interested in web accessibility.</p>
<h2 class="title is-4">Transcript</h2>
<p>1
00:00:02.040 --&gt; 00:00:31.269
Gabor Szabo: Hello, and welcome to the Codemaven Channel and code meet and meeting meet up group. My name is Gabor Sabo. I teach Python. I'll help help companies with python and trust. And I also organize these meetings, these events online, because I think it's a very useful platform to share knowledge. And I'm really happy that Nikita agreed to give this presentation. Hello, Nikita.</p>
<p>2
00:00:31.270 --&gt; 00:00:32.040
Nikita Barysheva: Hi!</p>
<p>3
00:00:32.250 --&gt; 00:00:33.379
Nikita Barysheva: Nice to meet you all.</p>
<p>4
00:00:33.380 --&gt; 00:00:35.440
Gabor Szabo: And sorry, and</p>
<p>5
00:00:36.170 --&gt; 00:01:03.490
Gabor Szabo: Those people who are present thank you for for joining the the meeting, you can freely ask questions in the in the chat. And if you're watching the video on Youtube, then please, like the video and follow the channel and join our meet up. Meet up groups, so you will be notified when we have the new meetings new events. So with that, said Nikita, it's your turn. Please introduce yourself and give your presentation.</p>
<p>6
00:01:04.170 --&gt; 00:01:13.930
Nikita Barysheva: Yeah, sure, I'll start sharing my screen, and then probably I'll start exploring one second K,</p>
<p>7
00:01:14.770 --&gt; 00:01:20.399
Nikita Barysheva: Hi, everyone. Once again. My name is Nikita, and I'm a software developer checkpoint company right now.</p>
<p>8
00:01:20.590 --&gt; 00:01:32.040
Nikita Barysheva: And today I want to talk about one of the things I had in my previous experience when we decided to switch from Rds to Dynamodb</p>
<p>9
00:01:32.160 --&gt; 00:01:45.890
Nikita Barysheva: for our users table. And how we thought about it. What was the overall architecture, and how we build a micro services that helped us to switch to make this switch.</p>
<p>10
00:01:46.826 --&gt; 00:02:05.609
Nikita Barysheva: We'll cover different topics like we will talk about general differences, about our disb. And like, I will highlight some main things that might make you think why, to switch from one database to another, or will help to understand our motivation behind it.</p>
<p>11
00:02:05.740 --&gt; 00:02:17.010
Nikita Barysheva: And we will go over the architecture of the micro service that we build, and I'll give you some examples over there, and we can talk about it in more details if you want.</p>
<p>12
00:02:17.700 --&gt; 00:02:23.740
Nikita Barysheva: So let's have a quick overview. I'm I don't know</p>
<p>13
00:02:23.990 --&gt; 00:02:42.469
Nikita Barysheva: everyone familiar with the Dynamodb or Rds. What is the differences. But the main ones, like dynamodb, is a key value like no scale database. It's fully managed by aws, and it's very good for applications with like low latency, with like flexible data models</p>
<p>14
00:02:42.850 --&gt; 00:02:48.559
Nikita Barysheva: and opposite orders, is the SQL database, and we have, like a</p>
<p>15
00:02:48.830 --&gt; 00:02:59.830
Nikita Barysheva: predefined schemas, and it's also managed by aws, but the difference between diamond monds that for this you really have to invest more</p>
<p>16
00:03:00.110 --&gt; 00:03:13.869
Nikita Barysheva: into like knowledge into setting setting up things over there, and that for sure, if you have like complex queries and joins and etc, this is better for your solution.</p>
<p>17
00:03:15.296 --&gt; 00:03:26.219
Nikita Barysheva: When you decide on like which database to use, you will probably look at several things like scalability, performance availability.</p>
<p>18
00:03:26.410 --&gt; 00:03:28.730
Nikita Barysheva: And here I present some</p>
<p>19
00:03:29.070 --&gt; 00:03:36.149
Nikita Barysheva: basic stuff about differences like for each of them, and dynamic being out there but overall, saying again.</p>
<p>20
00:03:36.160 --&gt; 00:04:03.969
Nikita Barysheva: the for scalability. We know that dynamodity automatically scales horizontally, and that really helps to manage like a large amount of traffics without any interventions at the same time, like our desk scales vertically, and it has to increase the instance size, and this increase. It also takes time, and there might be some gaps in performance also because of that</p>
<p>21
00:04:03.990 --&gt; 00:04:05.570
Nikita Barysheva: and the</p>
<p>22
00:04:05.760 --&gt; 00:04:30.199
Nikita Barysheva: for performance. It really depends on the type of instance you chose the storage. Like, as I said before, you really need to know what you are doing there, and how you're setting it up, because if you won't do it like properly, you might, you might have some slowness or database won't be available. Something will be down, and users won't be happy.</p>
<p>23
00:04:30.250 --&gt; 00:04:37.455
Nikita Barysheva: And as for availability, I found out we found out basically for ourself.</p>
<p>24
00:04:38.560 --&gt; 00:04:46.200
Nikita Barysheva: like, say, big difference for dynamodb. And there is a thing like you that you can activate that calls global tables</p>
<p>25
00:04:46.390 --&gt; 00:04:50.660
Nikita Barysheva: like, it's like a multi-region multi master, right? Database solution</p>
<p>26
00:04:50.790 --&gt; 00:05:02.400
Nikita Barysheva: for this. It supports. And let's call it multi az multi availability zones. It's replicates the data across different availability zones. But in the same region.</p>
<p>27
00:05:03.740 --&gt; 00:05:09.610
Nikita Barysheva: And the another thing that you will have to consider it will be interesting for you is like</p>
<p>28
00:05:10.090 --&gt; 00:05:11.680
Nikita Barysheva: cost consideration.</p>
<p>29
00:05:13.510 --&gt; 00:05:28.390
Nikita Barysheva: or is, let's say, dynamic price pricing and Rds cost will increase as you scale vertical, like large instances horizontally like read replicas. Also Rds also provides like on demand.</p>
<p>30
00:05:28.490 --&gt; 00:05:38.600
Nikita Barysheva: but still, if you like, chose the the instance with a special like to say storage of I don't remember exactly the batches there, but</p>
<p>31
00:05:38.830 --&gt; 00:05:43.869
Nikita Barysheva: let's say, 60 gigas. You will have to pay for 60 jigas. Even you use 20 of them.</p>
<p>32
00:05:44.190 --&gt; 00:06:03.179
Nikita Barysheva: So efficient hybrid handling. Dynamodb is really optimized for hybrid scenarios, and it doesn't provide different replicas. Okay, it can handle millions of requests per second with the architecture that Kws provides to us. And the</p>
<p>33
00:06:03.480 --&gt; 00:06:12.200
Nikita Barysheva: Pre. We want. We all want to have, let's say, predictable cost and capacity modes, and because of that.</p>
<p>34
00:06:12.560 --&gt; 00:06:18.709
Nikita Barysheva: into benefit of Dynamodb, Dynamodb offers 2 modes, provisioned capacity. When you</p>
<p>35
00:06:18.910 --&gt; 00:06:24.920
Nikita Barysheva: have when you set up the database. Basically, you have to say how many read and writes like.</p>
<p>36
00:06:26.620 --&gt; 00:06:34.859
Nikita Barysheva: what is the bar? Let's say for them for for your database, and or you can use on demand</p>
<p>37
00:06:35.010 --&gt; 00:06:53.600
Nikita Barysheva: that will automatically scale up your traffic and ensure you pay only for the usage. We had a situation when we we worked on one online store, and we had a situation that we didn't predict, because no one is about like no one's following super bowl in United States.</p>
<p>38
00:06:53.800 --&gt; 00:07:00.989
Nikita Barysheva: But we did. We just lost it. And the traffic went up</p>
<p>39
00:07:01.790 --&gt; 00:07:08.899
Nikita Barysheva: and the people tried to buy beer in United States order it online, and we didn't expect that. But thanks to</p>
<p>40
00:07:09.060 --&gt; 00:07:16.610
Nikita Barysheva: the dynamo debit architecture like scaled up automatically and we are, we were on the pretty good side.</p>
<p>41
00:07:18.662 --&gt; 00:07:22.460
Nikita Barysheva: These are very general customer reviews. Okay.</p>
<p>42
00:07:22.930 --&gt; 00:07:30.140
Nikita Barysheva: hey? I just wanted to give you some examples. Don't take it like strict that you have to calculate it like this. I just wanted to</p>
<p>43
00:07:30.470 --&gt; 00:07:32.630
Nikita Barysheva: have you, Nick.</p>
<p>44
00:07:33.130 --&gt; 00:07:35.150
Nikita Barysheva: Basic understanding. Okay.</p>
<p>45
00:07:36.065 --&gt; 00:07:41.960
Nikita Barysheva: For this, you pay, for instance, cost and for the storage.</p>
<p>46
00:07:42.830 --&gt; 00:07:50.800
Nikita Barysheva: Once again the the specification could be more complicated. But we're talking about basics.</p>
<p>47
00:07:50.930 --&gt; 00:07:57.739
Nikita Barysheva: And for dynamo dB, you pay for right capacity units, read capacity units, and also for data storage.</p>
<p>48
00:07:58.080 --&gt; 00:08:05.850
Nikita Barysheva: But where regarding the storage, and I wanted to give you some like more detailed calculations here.</p>
<p>49
00:08:07.160 --&gt; 00:08:13.440
Nikita Barysheva: If you, for example, want to store 5 gigs, 10 gigas, 20 gigs.</p>
<p>50
00:08:13.610 --&gt; 00:08:21.249
Nikita Barysheva: You will pay the same price for the storage all this time, because, as I said before, you</p>
<p>51
00:08:21.470 --&gt; 00:08:24.479
Nikita Barysheva: choose the storage type, and you have to pay for it.</p>
<p>52
00:08:24.650 --&gt; 00:08:28.259
Nikita Barysheva: even if you pay, even if you use less.</p>
<p>53
00:08:28.470 --&gt; 00:08:36.140
Nikita Barysheva: Okay? At the same time you see that for download, my dB, this thing is dynamic.</p>
<p>54
00:08:37.179 --&gt; 00:08:41.329
Nikita Barysheva: and it depends on the real real story that you use.</p>
<p>55
00:08:41.620 --&gt; 00:08:48.309
Nikita Barysheva: There are more things that I mentioned here. I'm not sure if you want to be overwhelmed right now let me know. But</p>
<p>56
00:08:48.660 --&gt; 00:08:56.309
Nikita Barysheva: these are the very basic things that I wanted you to consider just to understand are the dynamodity.</p>
<p>57
00:08:56.950 --&gt; 00:08:57.940
Nikita Barysheva: And</p>
<p>58
00:08:58.050 --&gt; 00:09:11.530
Nikita Barysheva: yeah, so we talked about different like database types like SQL, Nonsql, specifically, Rds actually mentioned didn't mention it. But it considered Postgresql.</p>
<p>59
00:09:11.710 --&gt; 00:09:15.120
Nikita Barysheva: if it's important and diamond.</p>
<p>60
00:09:15.630 --&gt; 00:09:22.709
Nikita Barysheva: Now, I want to talk about the the actual problem that we had and the solution</p>
<p>61
00:09:22.890 --&gt; 00:09:24.870
Nikita Barysheva: that we found out for ourselves.</p>
<p>62
00:09:28.790 --&gt; 00:09:33.650
Nikita Barysheva: So the overall problem was that that when the</p>
<p>63
00:09:34.180 --&gt; 00:09:38.829
Nikita Barysheva: the number of users, like number of requests to the database</p>
<p>64
00:09:38.950 --&gt; 00:09:42.869
Nikita Barysheva: scaled like went up, we had spikes.</p>
<p>65
00:09:43.020 --&gt; 00:09:59.789
Nikita Barysheva: Our our desk like didn't work well sometimes. So we decided that we need to do something more stable. And we started to consider different databases. And because we had previous experience with dynamic or another project we had.</p>
<p>66
00:10:00.150 --&gt; 00:10:09.900
Nikita Barysheva: we decided that we want to build an architecture where all our clients will go to the dynamo dB,</p>
<p>67
00:10:10.060 --&gt; 00:10:15.369
Nikita Barysheva: through a user's micro service. But</p>
<p>68
00:10:15.520 --&gt; 00:10:22.409
Nikita Barysheva: the problem is another problem is that today all of our users are stored in Postgrescale.</p>
<p>69
00:10:22.960 --&gt; 00:10:26.039
Nikita Barysheva: So how to how to manage it.</p>
<p>70
00:10:26.230 --&gt; 00:10:35.769
Nikita Barysheva: 2 different databases, and like, not just physically, database. I mean different types, databases. That's kind of challenge. Okay?</p>
<p>71
00:10:35.930 --&gt; 00:10:41.610
Nikita Barysheva: So these are really again.</p>
<p>72
00:10:42.104 --&gt; 00:10:49.590
Nikita Barysheva: general overview of the solution. But on the left side. You see the clients. Each of them is like A,</p>
<p>73
00:10:49.840 --&gt; 00:10:51.350
Nikita Barysheva: the client that</p>
<p>74
00:10:51.510 --&gt; 00:10:57.339
Nikita Barysheva: it could be a back end client that wants to get data about the special and specific user.</p>
<p>75
00:10:57.500 --&gt; 00:11:01.464
Nikita Barysheva: or to get all the users by some condition and</p>
<p>76
00:11:02.330 --&gt; 00:11:16.599
Nikita Barysheva: how we do it. We decided to implement several feature flags, including like that, will tell us where we should read the data from or where we want now to write the data to.</p>
<p>77
00:11:16.870 --&gt; 00:11:24.160
Nikita Barysheva: And basing on this feature flex. We were doing like, get requests, or we're doing like.</p>
<p>78
00:11:24.690 --&gt; 00:11:33.529
Nikita Barysheva: put post to delete. We do all the separations based on this feature flags. And this is the like</p>
<p>79
00:11:33.770 --&gt; 00:11:40.780
Nikita Barysheva: Postgresql architecture, nothing like special here. And this is the user service. So we have the</p>
<p>80
00:11:41.200 --&gt; 00:11:47.700
Nikita Barysheva: containers here, and we use readies for caching, for caching and Dynama dB.</p>
<p>81
00:11:48.030 --&gt; 00:11:54.420
Nikita Barysheva: Without additional details here. But I mean I think it could be pretty clear what we are</p>
<p>82
00:11:54.760 --&gt; 00:11:59.329
Nikita Barysheva: trying to do here. Let me know if you have any questions so far.</p>
<p>83
00:11:59.700 --&gt; 00:12:06.740
Nikita Barysheva: I will. I will be happy to answer them, because this scheme, if if you have question, I will be happy to answer them just</p>
<p>84
00:12:06.860 --&gt; 00:12:10.370
Nikita Barysheva: for you to be and to make it more clear later.</p>
<p>85
00:12:12.630 --&gt; 00:12:13.830
Nikita Barysheva: And then.</p>
<p>86
00:12:15.170 --&gt; 00:12:23.570
Nikita Barysheva: except the fact that we want to transfer to Dynamodb, we need to have this transition period. So.</p>
<p>87
00:12:23.800 --&gt; 00:12:27.390
Nikita Barysheva: as you saw on the previous like scheme.</p>
<p>88
00:12:27.660 --&gt; 00:12:36.229
Nikita Barysheva: we decided to plan to implement the service like Api, that will handle all crowd operations related to our dynamic.</p>
<p>89
00:12:36.530 --&gt; 00:12:42.040
Nikita Barysheva: And we also need to transfer all users data from Rds to Dynamodb.</p>
<p>90
00:12:42.160 --&gt; 00:12:49.420
Nikita Barysheva: This was done. We wrote different scripts. We basically can grab the data from their desk.</p>
<p>91
00:12:49.730 --&gt; 00:12:58.090
Nikita Barysheva: transform the data as we want. And to basically transfer this data to Dynamodb.</p>
<p>92
00:12:58.640 --&gt; 00:13:02.679
Nikita Barysheva: And we also decided, as I mentioned</p>
<p>93
00:13:02.860 --&gt; 00:13:11.260
Nikita Barysheva: in a previous slide. We decided that we want to have feature flags. The feature of the 1st feature flag read user from Dynamodb.</p>
<p>94
00:13:11.450 --&gt; 00:13:12.520
Nikita Barysheva: If it through.</p>
<p>95
00:13:12.780 --&gt; 00:13:19.020
Nikita Barysheva: we go to Dynamodb the micro service and Dynamodb. If it's false, we go directly to the Postgrescale</p>
<p>96
00:13:19.160 --&gt; 00:13:24.129
Nikita Barysheva: and read user to our Ds and write write user to our Ds and to dynamodb.</p>
<p>97
00:13:24.330 --&gt; 00:13:30.630
Nikita Barysheva: these are 2 flags that basically we need to support this period when we</p>
<p>98
00:13:31.270 --&gt; 00:13:37.399
Nikita Barysheva: we work with both databases. So we're trying, we try to make the spirit as short as possible.</p>
<p>99
00:13:37.680 --&gt; 00:13:41.950
Nikita Barysheva: to make some like tests on the Qa. On staging and then on production.</p>
<p>100
00:13:42.504 --&gt; 00:13:47.880
Nikita Barysheva: We still have to work some production. But once we saw that everything works</p>
<p>101
00:13:48.230 --&gt; 00:13:51.070
Nikita Barysheva: like fine. When we don't have any</p>
<p>102
00:13:51.200 --&gt; 00:14:11.700
Nikita Barysheva: request from the customers, we don't have any bugs opening. So we closed right user to our desk. So the channel, let's go back for a second if I can. Yeah, basically, this channel. This path was closed. So we just continued working directly with our user service.</p>
<p>103
00:14:12.150 --&gt; 00:14:23.140
Nikita Barysheva: The read from dynamo debu was always true, and the right to dynamo debut also. True, right to our death was false. So all this scheme started</p>
<p>104
00:14:23.290 --&gt; 00:14:25.420
Nikita Barysheva: working only with this part.</p>
<p>105
00:14:25.690 --&gt; 00:14:27.470
Nikita Barysheva: Avoid imposed risk.</p>
<p>106
00:14:28.655 --&gt; 00:14:29.020
Nikita Barysheva: Okay,</p>
<p>107
00:14:30.780 --&gt; 00:14:42.270
Nikita Barysheva: I wanted to show the client side architecture. The client, as I mentioned before, like every service that wants to get information about the the user</p>
<p>108
00:14:42.800 --&gt; 00:14:48.340
Nikita Barysheva: and just wanted to give some code examples and to explain what we</p>
<p>109
00:14:48.470 --&gt; 00:14:50.530
Nikita Barysheva: just generally try to achieve that.</p>
<p>110
00:14:51.320 --&gt; 00:15:01.800
Nikita Barysheva: We wrote like a model interface, and the purpose of such interface is to be a handle for all calls</p>
<p>111
00:15:01.970 --&gt; 00:15:05.099
Nikita Barysheva: to Dynamodb through the service.</p>
<p>112
00:15:05.370 --&gt; 00:15:13.029
Nikita Barysheva: Okay, handle responses from service. Manage all the Retries, manage all the caching, etcetera. So be</p>
<p>113
00:15:13.250 --&gt; 00:15:19.509
Nikita Barysheva: the one that gets the data for for the client from the service.</p>
<p>114
00:15:20.940 --&gt; 00:15:23.820
Nikita Barysheva: It could like look like that.</p>
<p>115
00:15:23.950 --&gt; 00:15:31.610
Nikita Barysheva: And one of the functions that we could use like get user by email.</p>
<p>116
00:15:32.030 --&gt; 00:15:37.850
Nikita Barysheva: we initiate the user client with some parameters over here.</p>
<p>117
00:15:38.050 --&gt; 00:15:50.039
Nikita Barysheva: One of the parameters that I really like to really like to mention is a requester Id. I will explain later. I can explain actually, right now, because</p>
<p>118
00:15:50.260 --&gt; 00:15:51.730
Nikita Barysheva: why we need it.</p>
<p>119
00:15:52.010 --&gt; 00:16:00.319
Nikita Barysheva: basically for the login and for the tracking, something fells down. I really like that, we know which client made this request.</p>
<p>120
00:16:00.690 --&gt; 00:16:16.299
Nikita Barysheva: and on the left side the function itself which uses the the feature flag. The code could be optimized. Don't look at it like as a perfect one, just wanted to make it as much clear and readable on one slide as possible.</p>
<p>121
00:16:16.830 --&gt; 00:16:22.635
Nikita Barysheva: So if we want to get a user by email, we looking at this feature flag. And</p>
<p>122
00:16:23.180 --&gt; 00:16:27.599
Nikita Barysheva: we basically want to get to have a request to dynamodb</p>
<p>123
00:16:27.790 --&gt; 00:16:40.490
Nikita Barysheva: service. This is a client link. And we want to to make the Cpi call, else we're going as we did it before we just go into Postgresql and getting that data over there.</p>
<p>124
00:16:43.070 --&gt; 00:16:48.000
Nikita Barysheva: And this is the function that user from dynamodvis</p>
<p>125
00:16:48.250 --&gt; 00:16:51.309
Nikita Barysheva: made it detect more specific over here.</p>
<p>126
00:16:51.770 --&gt; 00:16:54.490
Nikita Barysheva: We're setting up all the parameters that we want to get.</p>
<p>127
00:16:54.650 --&gt; 00:17:08.140
Nikita Barysheva: And we're making Api call to the to the route, and we are handling the the response. You can handle it wherever you want. We at that moment in time decided that we want to return</p>
<p>128
00:17:08.359 --&gt; 00:17:09.180
Nikita Barysheva: to</p>
<p>129
00:17:09.440 --&gt; 00:17:25.660
Nikita Barysheva: 2 values here. The 1st one is like represents the status, if it's okay or not. And the second one is the response. So we can check if it's if the request was okay or not. And this is actually the call Api function that actually makes a request</p>
<p>130
00:17:26.160 --&gt; 00:17:44.499
Nikita Barysheva: to the service it has. Like some Retries, you can set up whatever you want, and once again you can make it better. If you want. Logging. You can make your request itself, and for sure, error, error handling with logins also.</p>
<p>131
00:17:45.580 --&gt; 00:17:50.759
Nikita Barysheva: And if any questions so far, let me know</p>
<p>132
00:17:54.020 --&gt; 00:18:01.149
Nikita Barysheva: it's this is one of the examples how a client can make a get request</p>
<p>133
00:18:01.460 --&gt; 00:18:05.160
Nikita Barysheva: to to the micro service that will then</p>
<p>134
00:18:05.320 --&gt; 00:18:08.100
Nikita Barysheva: make like, get a data from the dynamo. dB,</p>
<p>135
00:18:08.310 --&gt; 00:18:13.170
Nikita Barysheva: so let's have a look at one of the routes micro service itself.</p>
<p>136
00:18:16.340 --&gt; 00:18:20.059
Nikita Barysheva: As we know, we decided to use Dynamodb.</p>
<p>137
00:18:20.620 --&gt; 00:18:25.700
Nikita Barysheva: 1st of all, you have to create this table. I just wanted to give you some</p>
<p>138
00:18:25.910 --&gt; 00:18:29.100
Nikita Barysheva: quick overview what is included like.</p>
<p>139
00:18:29.620 --&gt; 00:18:36.320
Nikita Barysheva: you see here some params, including, like key schema, that defines the primary key.</p>
<p>140
00:18:36.680 --&gt; 00:18:40.250
Nikita Barysheva: Primary key could be also like a composite key</p>
<p>141
00:18:40.490 --&gt; 00:18:45.620
Nikita Barysheva: of 2, let's say, 2 fields, and they</p>
<p>142
00:18:45.940 --&gt; 00:18:52.720
Nikita Barysheva: into. This is what actually help us to get to get the data like quicker.</p>
<p>143
00:18:53.170 --&gt; 00:18:58.179
Nikita Barysheva: We have different attribute definitions that describes the primary key</p>
<p>144
00:18:58.901 --&gt; 00:19:07.130
Nikita Barysheva: fields. And we can also set up different global secondary indexes. One of them for me is like email index.</p>
<p>145
00:19:07.260 --&gt; 00:19:14.640
Nikita Barysheva: It can that allows us to search by email also, not only by Id, but you may have different indexes. Not only one</p>
<p>146
00:19:17.307 --&gt; 00:19:19.250
Nikita Barysheva: about the routes.</p>
<p>147
00:19:19.872 --&gt; 00:19:31.540
Nikita Barysheva: maybe it could be obvious for many of you. But actually, I was surprised when it wasn't for some of like other developers. When I talked to them.</p>
<p>148
00:19:31.670 --&gt; 00:19:41.340
Nikita Barysheva: The basic service handles all the get put post. Delete requests easily like should should do it. Okay. But</p>
<p>149
00:19:41.820 --&gt; 00:19:44.709
Nikita Barysheva: things that people are really missing like</p>
<p>150
00:19:45.650 --&gt; 00:19:51.849
Nikita Barysheva: what we we want to update many users at once. What we want to create create many users at once.</p>
<p>151
00:19:55.600 --&gt; 00:19:56.680
Nikita Barysheva: Someone called it.</p>
<p>152
00:19:57.040 --&gt; 00:20:00.379
Nikita Barysheva: So when you talk about dynamo D,</p>
<p>153
00:20:01.020 --&gt; 00:20:09.059
Nikita Barysheva: and when we talk about the cost consideration, it's much like better.</p>
<p>154
00:20:09.190 --&gt; 00:20:12.480
Nikita Barysheva: Let's say you want to create 100 users.</p>
<p>155
00:20:12.720 --&gt; 00:20:17.130
Nikita Barysheva: You have a reason for that. Let's say you don't go in a for loop</p>
<p>156
00:20:17.290 --&gt; 00:20:20.119
Nikita Barysheva: and creating like one after another.</p>
<p>157
00:20:20.290 --&gt; 00:20:35.150
Nikita Barysheva: You're sending the batch of 100 users, and they basically, this batch will be divided by 2 chunks to chunks by 25 records, and it will be like already</p>
<p>158
00:20:35.280 --&gt; 00:20:41.920
Nikita Barysheva: or requests it will be done to Dynamodb, so much, much less. Okay.</p>
<p>159
00:20:42.170 --&gt; 00:20:43.479
Gabor Szabo: There is a question.</p>
<p>160
00:20:43.770 --&gt; 00:20:44.190
Gabor Szabo: Oh.</p>
<p>161
00:20:44.190 --&gt; 00:20:44.690
Nikita Barysheva: Yep.</p>
<p>162
00:20:44.690 --&gt; 00:20:49.730
Gabor Szabo: How did you convert the data model from relational to No. SQL.</p>
<p>163
00:20:50.550 --&gt; 00:20:57.619
Nikita Barysheva: Yeah, okay, that's good. One 3D is actually so, since we know that SQL,</p>
<p>164
00:20:57.730 --&gt; 00:21:00.940
Nikita Barysheva: he's like, Hey, we have this strong structure</p>
<p>165
00:21:01.050 --&gt; 00:21:09.219
Nikita Barysheva: like a fixed structure. We now we know what to expect and basically created another dictionary.</p>
<p>166
00:21:09.330 --&gt; 00:21:16.340
Nikita Barysheva: Whoever of the user and transferred it like basically renew</p>
<p>167
00:21:17.000 --&gt; 00:21:21.549
Nikita Barysheva: what is the the scheme of the Postgresql.</p>
<p>168
00:21:21.780 --&gt; 00:21:29.169
Nikita Barysheva: we received the the users, we transform that like basic dictionary and</p>
<p>169
00:21:29.620 --&gt; 00:21:35.050
Nikita Barysheva: using the post method like with the bulk, created it in Dynamodb.</p>
<p>170
00:21:35.610 --&gt; 00:21:41.820
Nikita Barysheva: And that's that's it. No, no, not really, no, no magic over there. Actually.</p>
<p>171
00:21:43.180 --&gt; 00:21:49.429
Nikita Barysheva: As for the daytime object, maybe this this is what may be specifically interest interesting to you.</p>
<p>172
00:21:50.448 --&gt; 00:21:56.219
Nikita Barysheva: Restored date as a string, so we can convert it.</p>
<p>173
00:21:56.790 --&gt; 00:22:08.040
Nikita Barysheva: You can also store it in a milliseconds. What else we had over there. So we had booleaning like sorry Boolean Boolean fields, and</p>
<p>174
00:22:08.680 --&gt; 00:22:13.980
Nikita Barysheva: nothing really special that can change you</p>
<p>175
00:22:14.170 --&gt; 00:22:18.829
Nikita Barysheva: just converting an object that you're getting from a Postgresql</p>
<p>176
00:22:19.010 --&gt; 00:22:25.190
Nikita Barysheva: to the object that will be suitable for dynamo. dB, yeah.</p>
<p>177
00:22:26.010 --&gt; 00:22:31.520
Nikita Barysheva: and that answer your question, or it can be more specific, excellent.</p>
<p>178
00:22:35.980 --&gt; 00:22:41.230
Nikita Barysheva: and just don't see if it yes or no, just sharing screen.</p>
<p>179
00:22:41.620 --&gt; 00:22:43.050
Gabor Szabo: Yes, he says.</p>
<p>180
00:22:43.800 --&gt; 00:22:47.959
Nikita Barysheva: Okay, yeah. I see here now.</p>
<p>181
00:22:50.110 --&gt; 00:22:56.679
Nikita Barysheva: Yep. So talked about obvious, not obvious things. And</p>
<p>182
00:22:59.590 --&gt; 00:23:05.980
Nikita Barysheva: this is a service set up. I also tried to put many things and</p>
<p>183
00:23:06.270 --&gt; 00:23:15.079
Nikita Barysheva: one screen. Can you see it? Because when I just showed it, live, people struggle to see it. I just want to.</p>
<p>184
00:23:15.080 --&gt; 00:23:20.760
Gabor Szabo: If you could enlarge a little bit the whole thing that might be nice. I don't know if it's if you can do that.</p>
<p>185
00:23:23.290 --&gt; 00:23:30.279
Nikita Barysheva: One second doesn't look like it allows in in this note.</p>
<p>186
00:23:32.750 --&gt; 00:23:34.000
Nikita Barysheva: This doesn't help.</p>
<p>187
00:23:38.820 --&gt; 00:23:48.070
Gabor Szabo: Why did you need? There's also another question. In the meantime, why did you need readies? If dynamo dynamodb has great performance on, reads.</p>
<p>188
00:23:49.990 --&gt; 00:24:00.989
Nikita Barysheva: Because it's also good for saving money actually. And the radius can be can be applied in a second.</p>
<p>189
00:24:01.400 --&gt; 00:24:06.300
Nikita Barysheva: It's a good one, because let's go over here.</p>
<p>190
00:24:14.750 --&gt; 00:24:15.500
Nikita Barysheva: Okay.</p>
<p>191
00:24:15.800 --&gt; 00:24:21.639
Nikita Barysheva: So we now saying that we are working like this, okay.</p>
<p>192
00:24:21.800 --&gt; 00:24:31.819
Nikita Barysheva: we are going from client to dynamo. dB, and I mentioned Red zone here. But also I had, I think, to mention the readies on this layer.</p>
<p>193
00:24:32.390 --&gt; 00:24:40.649
Nikita Barysheva: So what's happening right now is that our client makes another Api call to another service.</p>
<p>194
00:24:41.000 --&gt; 00:24:44.359
Nikita Barysheva: which, like every call, let's say, cost us something.</p>
<p>195
00:24:44.560 --&gt; 00:24:46.810
Nikita Barysheva: and then we go to dynamo. dB,</p>
<p>196
00:24:47.776 --&gt; 00:24:56.910
Nikita Barysheva: it's radius is not only about the speed, it's also about the the money, the costs reduction.</p>
<p>197
00:24:57.390 --&gt; 00:25:02.470
Nikita Barysheva: And we, for example, here at this layer.</p>
<p>198
00:25:02.810 --&gt; 00:25:07.010
Nikita Barysheva: if the if the client was created like</p>
<p>199
00:25:07.310 --&gt; 00:25:13.530
Nikita Barysheva: not properly, and the many requests. You don't catch the requests. You don't catch the results.</p>
<p>200
00:25:13.700 --&gt; 00:25:25.129
Nikita Barysheva: This client will make another request, another request, another request request, and it can grow up dramatically, and your and you will get like a huge cost after that. After all.</p>
<p>201
00:25:25.240 --&gt; 00:25:28.289
Nikita Barysheva: so red is the solution for that. Also.</p>
<p>202
00:25:29.710 --&gt; 00:25:38.919
Nikita Barysheva: Using readies for caching basically allows you, 1st of all to decrease the the load of the</p>
<p>203
00:25:39.190 --&gt; 00:25:45.409
Nikita Barysheva: of the service and the as a result, to decrease the the cost.</p>
<p>204
00:25:45.960 --&gt; 00:25:53.560
Nikita Barysheva: So one of the reasons which not everyone think about in the beginning is like the cost cost reductions.</p>
<p>205
00:25:54.890 --&gt; 00:25:58.660
Nikita Barysheva: Yep, is that okay?</p>
<p>206
00:26:03.800 --&gt; 00:26:04.800
Nikita Barysheva: Trying to</p>
<p>207
00:26:10.780 --&gt; 00:26:12.339
Nikita Barysheva: that answers the question.</p>
<p>208
00:26:15.790 --&gt; 00:26:23.640
Gabor Szabo: I think you can just go. So used, okay, I'm just reading out. So used redis or used aws, caching services.</p>
<p>209
00:26:24.360 --&gt; 00:26:26.529
Nikita Barysheva: Redis redis. We use reddish.</p>
<p>210
00:26:26.530 --&gt; 00:26:27.560
Gabor Szabo: Yeah, okay.</p>
<p>211
00:26:27.560 --&gt; 00:26:34.379
Nikita Barysheva: Because we, we use it widely in all projects. And we can. Yeah, we are used to Redis.</p>
<p>212
00:26:41.870 --&gt; 00:26:44.940
Gabor Szabo: I think we can. You can go back to the code example of.</p>
<p>213
00:26:50.400 --&gt; 00:26:58.774
Nikita Barysheva: Okay, the basically the 1st things that you will need for for the service.</p>
<p>214
00:26:59.670 --&gt; 00:27:01.669
Nikita Barysheva: It's like the the setup is.</p>
<p>215
00:27:01.820 --&gt; 00:27:05.519
Nikita Barysheva: It's pretty easy. First, st where is dynamed again hydantic.</p>
<p>216
00:27:06.150 --&gt; 00:27:10.750
Nikita Barysheva: even though we said that the dynamon dB. Is like a</p>
<p>217
00:27:11.340 --&gt; 00:27:14.620
Nikita Barysheva: we don't have to follow this strict schema.</p>
<p>218
00:27:15.040 --&gt; 00:27:17.980
Nikita Barysheva: It's a very good practice to have one, I mean</p>
<p>219
00:27:18.130 --&gt; 00:27:28.449
Nikita Barysheva: in dynamic D. When you create a record with a field like none, it won't be added, so we won't find it. If you go to the Ui, you you won't have it. But when you</p>
<p>220
00:27:28.820 --&gt; 00:27:33.899
Nikita Barysheva: getting the data when you return the data, it's very good to have the</p>
<p>221
00:27:34.400 --&gt; 00:27:41.840
Nikita Barysheva: and the model that you can use to serialize what you receive from dynamic, from the database. This will help.</p>
<p>222
00:27:42.300 --&gt; 00:27:47.980
Nikita Barysheva: like the client, to know what to expect and</p>
<p>223
00:27:49.110 --&gt; 00:27:54.539
Nikita Barysheva: ex avoid some unpredictable scenarios. So</p>
<p>224
00:27:54.670 --&gt; 00:27:58.719
Nikita Barysheva: the model is always good, even if you work with a</p>
<p>225
00:27:58.860 --&gt; 00:28:02.199
Nikita Barysheva: dynamodity for the key value database</p>
<p>226
00:28:02.560 --&gt; 00:28:08.259
Nikita Barysheva: you have, you see here, like like one of the examples of how you can model data.</p>
<p>227
00:28:08.590 --&gt; 00:28:16.620
Nikita Barysheva: And if there is a function that said user cache that sets a specific key value</p>
<p>228
00:28:16.900 --&gt; 00:28:21.400
Nikita Barysheva: into the into the readies with the like expiration time.</p>
<p>229
00:28:21.560 --&gt; 00:28:35.010
Nikita Barysheva: And so you can find the and set exception log error, that every time that something falls down. You will see it later. We just throw a properly log. Sorry I need to move the bar.</p>
<p>230
00:28:36.690 --&gt; 00:28:41.280
Nikita Barysheva: Yeah, that will give you some details.</p>
<p>231
00:28:42.850 --&gt; 00:28:49.040
Nikita Barysheva: I don't know why, but I also saw a lot a lot of examples of people trying to avoid it in logs.</p>
<p>232
00:28:49.220 --&gt; 00:29:04.850
Nikita Barysheva: Oh, forgetting any logs like, I think it's a must you have to to show. Like to have it written somewhere. What was the error? And what I mentioned before is like, requested Id.</p>
<p>233
00:29:04.950 --&gt; 00:29:19.059
Nikita Barysheva: It's very important if you have many, many services or clients that work with the users database, and that everything goes through the service like a main point. All these cloud iterations you need to know</p>
<p>234
00:29:19.410 --&gt; 00:29:28.760
Nikita Barysheva: made the request. When it's very good for analytics, you can use it for graph later on. You can use it to debug things</p>
<p>235
00:29:29.780 --&gt; 00:29:34.309
Nikita Barysheva: only only positive things from the logs that I see.</p>
<p>236
00:29:34.890 --&gt; 00:29:40.377
Nikita Barysheva: and also for sure it can increase the costs for some reason, for some reason. But</p>
<p>237
00:29:41.080 --&gt; 00:29:43.910
Nikita Barysheva: still have to find a balance.</p>
<p>238
00:29:46.030 --&gt; 00:29:48.930
Nikita Barysheva: Second, yep.</p>
<p>239
00:29:49.290 --&gt; 00:30:12.569
Nikita Barysheva: this is one of the simple examples how you can get the user, and you can get it by Id, by email, you can specify which fields to project. It's like select in the postgres. Clearly, when you do select email like 1st name, last name the same, and it will return you on the the skills, the same you can do in dynamodity</p>
<p>240
00:30:13.270 --&gt; 00:30:17.819
Nikita Barysheva: and basically calls projection attributes.</p>
<p>241
00:30:18.010 --&gt; 00:30:26.010
Nikita Barysheva: Also, you're checking. We're checking. If there are Id, or if there is Id or email provided.</p>
<p>242
00:30:26.480 --&gt; 00:30:28.410
Nikita Barysheva: because it's a very logical thing.</p>
<p>243
00:30:28.580 --&gt; 00:30:34.219
Nikita Barysheva: if nothing of this is provided you don't look for for the user.</p>
<p>244
00:30:34.380 --&gt; 00:30:41.199
Nikita Barysheva: Only if you don't have maybe a route that can do a conditional</p>
<p>245
00:30:41.747 --&gt; 00:30:49.370
Nikita Barysheva: conditional search. But here I decided to show, like the basic one but conditional. I mean, for example.</p>
<p>246
00:30:49.570 --&gt; 00:30:55.500
Nikita Barysheva: and in SQL, when you look for the user, which who's a</p>
<p>247
00:30:55.660 --&gt; 00:30:59.459
Nikita Barysheva: with the name Nikita, all the users with the name Nikita.</p>
<p>248
00:30:59.660 --&gt; 00:31:01.950
Nikita Barysheva: So you don't have Id or email.</p>
<p>249
00:31:02.150 --&gt; 00:31:07.829
Nikita Barysheva: But anyway, you can do it. The same thing over here just didn't include it over here.</p>
<p>250
00:31:12.668 --&gt; 00:31:18.590
Nikita Barysheva: This is the actual process. So before going to Dynamodb.</p>
<p>251
00:31:18.790 --&gt; 00:31:20.669
Nikita Barysheva: we are checking the ready sketch.</p>
<p>252
00:31:20.790 --&gt; 00:31:27.560
Nikita Barysheva: So we're checking the the cash, and if there is nothing in cash, so we proceed to actual</p>
<p>253
00:31:28.228 --&gt; 00:31:30.920
Nikita Barysheva: look up at the table.</p>
<p>254
00:31:31.400 --&gt; 00:31:33.129
Nikita Barysheva: So here is like</p>
<p>255
00:31:33.270 --&gt; 00:31:42.380
Nikita Barysheva: users. Table is like a let's say, an agent to initiate before initiated. Before that knows the function. Get item.</p>
<p>256
00:31:42.770 --&gt; 00:31:45.170
Nikita Barysheva: good item by by key.</p>
<p>257
00:31:45.310 --&gt; 00:31:53.150
Nikita Barysheva: and if we provide the projection attributes, and we provide and we say that please return us some specific fields.</p>
<p>258
00:31:53.320 --&gt; 00:31:59.249
Nikita Barysheva: or we search by email. Once again, you can optimize this code as you wish. But</p>
<p>259
00:32:00.060 --&gt; 00:32:08.269
Nikita Barysheva: again, this is just, for example. And then in the end of the day after we found out the user. If you found out the user, we cache it</p>
<p>260
00:32:12.675 --&gt; 00:32:16.210
Nikita Barysheva: in this specific example. I wanted to</p>
<p>261
00:32:17.120 --&gt; 00:32:23.600
Nikita Barysheva: to mention that we we potentially may have 3 types of successful response over here.</p>
<p>262
00:32:25.440 --&gt; 00:32:39.880
Nikita Barysheva: we may be in the situation. We may end up in a situation when we didn't find the any users according to condition that we were like, according to Id, that was provided or email. And we return that like empty object here.</p>
<p>263
00:32:40.270 --&gt; 00:32:58.629
Nikita Barysheva: or if we provide a projection attribute, we return the data as we received it from the Dynamodb. Or if we did find the user and didn't provide any projection attributes. Here. We want to serialize it. We, as I said, we have a model.</p>
<p>264
00:32:58.750 --&gt; 00:33:01.120
Nikita Barysheva: and we want to serialize the user</p>
<p>265
00:33:01.310 --&gt; 00:33:11.326
Nikita Barysheva: all the day. All the fields that we have inside the user will be patched as they are like a key in its value, and those that are not will get like</p>
<p>266
00:33:12.490 --&gt; 00:33:18.410
Nikita Barysheva: we'll get a default values. Normally you put them as none, because</p>
<p>267
00:33:18.750 --&gt; 00:33:24.750
Nikita Barysheva: there is nothing. There is no reason to put a something not relevant</p>
<p>268
00:33:25.500 --&gt; 00:33:28.119
Nikita Barysheva: depends depends on the business thing. But yeah.</p>
<p>269
00:33:30.190 --&gt; 00:33:34.359
Nikita Barysheva: And another thing that might be</p>
<p>270
00:33:34.780 --&gt; 00:33:38.750
Nikita Barysheva: that might be, that is also important. It's just like error handling.</p>
<p>271
00:33:39.500 --&gt; 00:33:47.180
Nikita Barysheva: We have different types of error handling. Please don't forget about it, please use it, and even though it may look like overwhelmed.</p>
<p>272
00:33:47.320 --&gt; 00:33:53.660
Nikita Barysheva: I find it sometimes much better to have it rather than avoiding it. And after that</p>
<p>273
00:33:53.850 --&gt; 00:34:02.910
Nikita Barysheva: something is crashing, and everyone like trying to understand what was the situation. You can handle it everything, with a general exception. But it.</p>
<p>274
00:34:04.130 --&gt; 00:34:09.680
Nikita Barysheva: if you are provided with the with the tools, why not use it? My idea</p>
<p>275
00:34:13.100 --&gt; 00:34:18.300
Nikita Barysheva: just wanted to sum up the service it's like</p>
<p>276
00:34:18.480 --&gt; 00:34:28.170
Nikita Barysheva: intended for, and there are 2 things that you see here that I didn't mention up. But they're very important for the services like that.</p>
<p>277
00:34:28.550 --&gt; 00:34:40.309
Nikita Barysheva: So the service idea is like handling dynamodity requests. Like all crowd operations, it also should be able to cache things</p>
<p>278
00:34:40.730 --&gt; 00:34:43.080
Nikita Barysheva: to avoid like if</p>
<p>279
00:34:43.239 --&gt; 00:34:50.190
Nikita Barysheva: if the query already like, if the request already got to the service by some reason, but</p>
<p>280
00:34:50.460 --&gt; 00:34:57.029
Nikita Barysheva: you have the cached values. Still, even the service got the request. There is no reason to</p>
<p>281
00:34:57.250 --&gt; 00:34:59.759
Nikita Barysheva: to bother Dynamodb, because it's</p>
<p>282
00:34:59.940 --&gt; 00:35:07.190
Nikita Barysheva: after all, it's another. It's another like cent. It doesn't sound like another dollar. Let's call it so.</p>
<p>283
00:35:07.360 --&gt; 00:35:18.809
Nikita Barysheva: But if you think about the very big scale, like when you have millions, tens of millions of users, if something won't be covered, it could cost you a lot, so</p>
<p>284
00:35:19.310 --&gt; 00:35:23.910
Nikita Barysheva: I would prefer using caching and</p>
<p>285
00:35:24.760 --&gt; 00:35:27.969
Nikita Barysheva: rather than avoiding using it. And</p>
<p>286
00:35:28.400 --&gt; 00:35:30.560
Nikita Barysheva: to save some money over here</p>
<p>287
00:35:31.805 --&gt; 00:35:37.329
Nikita Barysheva: we need to have a proper error error handler. That's why I mentioned 4 of them.</p>
<p>288
00:35:37.730 --&gt; 00:35:42.989
Nikita Barysheva: and maybe someone won't like it. But I did it, and the 2 things that I didn't mention here is like</p>
<p>289
00:35:43.410 --&gt; 00:35:47.720
Nikita Barysheva: you need to have throttling mechanism and rate limiter. Actually.</p>
<p>290
00:35:48.970 --&gt; 00:35:53.220
Nikita Barysheva: it depends, I mean, it could be done on the</p>
<p>291
00:35:53.900 --&gt; 00:36:02.979
Nikita Barysheva: should be done also on the service side, because, like services like what? Let's say, it's standalone thing. But you also may think about like</p>
<p>292
00:36:03.150 --&gt; 00:36:05.360
Nikita Barysheva: trotting on the clients. So I mean</p>
<p>293
00:36:06.180 --&gt; 00:36:16.179
Nikita Barysheva: throttling for sure and rate limiter is one, since we're talking about the service already is 2 things are very important to have in your services like that.</p>
<p>294
00:36:16.340 --&gt; 00:36:18.929
Nikita Barysheva: So don't forget to cover it.</p>
<p>295
00:36:19.090 --&gt; 00:36:25.790
Nikita Barysheva: And then I think that's actually cute.</p>
<p>296
00:36:25.990 --&gt; 00:36:26.790
Nikita Barysheva: Hey?</p>
<p>297
00:36:27.190 --&gt; 00:36:29.089
Nikita Barysheva: Yeah, let's it.</p>
<p>298
00:36:29.270 --&gt; 00:36:32.549
Nikita Barysheva: And I think you did it faster, you know</p>
<p>299
00:36:33.410 --&gt; 00:36:39.910
Nikita Barysheva: if you have any questions, just let me know. I would be happy to answer them again, if not.</p>
<p>300
00:36:40.100 --&gt; 00:36:44.509
Nikita Barysheva: thanks for listening. I hope it was interesting to you, and</p>
<p>301
00:36:44.830 --&gt; 00:36:47.499
Nikita Barysheva: we'll give you some ideas, maybe, or</p>
<p>302
00:36:47.940 --&gt; 00:36:50.580
Nikita Barysheva: you will decide to do something similar to this.</p>
<p>303
00:36:52.160 --&gt; 00:36:53.080
Nikita Barysheva: Just let me know.</p>
<p>304
00:36:54.640 --&gt; 00:36:55.580
Gabor Szabo: Well.</p>
<p>305
00:36:55.890 --&gt; 00:37:05.981
Gabor Szabo: thank thank you for the presentation. Any. If anyone has any more questions. That would be a good idea to ask now. If not, then.</p>
<p>306
00:37:08.180 --&gt; 00:37:16.060
Gabor Szabo: thank you very much for for giving this presentation and for being here, and for the those who were watching it live.</p>
<p>307
00:37:16.190 --&gt; 00:37:23.169
Gabor Szabo: Now you have the chance. So I'm telling it also to the viewers on Youtube Channel that those people who are here</p>
<p>308
00:37:23.820 --&gt; 00:37:33.870
Gabor Szabo: in the live meeting they can stay on, and after we stop the recording we can open the the mics, and then we can have a conversation</p>
<p>309
00:37:33.990 --&gt; 00:37:51.040
Gabor Szabo: asking all kind of other questions that you might have not wanted to do with be on the on the video. So anyway, thank you for being here. Thank you for watching, and don't forget to like the video and follow the channel and see you next time in the next, and and</p>
<p>310
00:37:51.220 --&gt; 00:37:55.589
Gabor Szabo: join the Meetup group. If if you're not there yet and thank you.</p>
<p>311
00:37:56.960 --&gt; 00:37:58.029
Nikita Barysheva: Thank you. Everyone.</p>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>Using Streamlit to create interactive web apps and deploy machine learning models with Leah Levy</title>
    <summary type="html"><![CDATA[]]></summary>
    <updated>2025-03-13T07:30:01Z</updated>
    <pubDate>2025-03-13T07:30:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/using-streamlit" />
    <id>https://python.code-maven.com/using-streamlit</id>
    <content type="html"><![CDATA[<iframe width="560" height="315" src="https://www.youtube.com/embed/cnaNvhuolBs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<p>Discover how to quickly turn your Python scripts into interactive web apps using <a href="https://streamlit.io/">Streamlit</a>. This session will cover key features like visualisations, widgets, and deployment, empowering you to create user-friendly interfaces with minimal effort.</p>
<ul>
<li><a href="https://www.linkedin.com/in/llevy1/">Leah Levy on LinkedIn</a></li>
<li><a href="https://github.com/LLevy1/">Leah Levy on GitHub</a></li>
</ul>
<p><img src="images/leah-levy.jpeg" alt="Leah Levy" /></p>
<h2 class="title is-4">Transcript</h2>
<p>1
00:00:02.390 --&gt; 00:00:05.820
Gabor Szabo: So hello and welcome to the Code Maven Channel.</p>
<p>2
00:00:05.960 --&gt; 00:00:14.180
Gabor Szabo: My name is Gabor. I organize these events because I think it's very important for people to be able to share their knowledge.</p>
<p>3
00:00:14.410 --&gt; 00:00:38.479
Gabor Szabo: and it's very useful for everyone else to learn from other people all around the world. I myself usually teach python and rust and help companies introduce testing in these 2 languages or introduce these languages. And that's it. Basically, this channel is mostly with. Now these videos from these meetings.</p>
<p>4
00:00:38.860 --&gt; 00:01:07.330
Gabor Szabo: and I am really happy that you agreed to give this presentation in our meeting, and thank you everyone for joining us here. If you are in the Zoom Meeting. Then feel free to ask questions. Just remember that it's going to be in Youtube. If you're watching it in Youtube, then. And if you enjoy this video, then please, like the video and follow the channel, and later on we'll have below the the video links</p>
<p>5
00:01:07.380 --&gt; 00:01:14.970
Gabor Szabo: and where you can contact layer as well if you are interested later on. So now it's your turn. Go ahead.</p>
<p>6
00:01:15.950 --&gt; 00:01:16.850
Gabor Szabo: Welcome now.</p>
<p>7
00:01:21.690 --&gt; 00:01:30.810
Leah Levy: so hopefully, you can see my screen. So my name is Leah. I'm currently living in the Uk, I'm a data scientist in the Uk.</p>
<p>8
00:01:30.810 --&gt; 00:01:41.190
Gabor Szabo: Maybe I it's only just me, but I can see all the list of the people who are joined. Is it on your screen, or it's just mine. No, it's it's I think you're sharing that one.</p>
<p>9
00:01:43.930 --&gt; 00:01:44.700
Leah Levy: Yeah.</p>
<p>10
00:01:49.860 --&gt; 00:01:53.950
Gabor Szabo: Wait a second. Maybe it's my, it's mine. No view.</p>
<p>11
00:01:54.720 --&gt; 00:01:56.819
Gabor Szabo: Yeah, no, it was mine. Sorry.</p>
<p>12
00:01:58.430 --&gt; 00:02:00.800
Gabor Szabo: Sorry, confusing you. Okay.</p>
<p>13
00:02:02.020 --&gt; 00:02:02.550
Leah Levy: It's okay.</p>
<p>14
00:02:02.550 --&gt; 00:02:06.490
Gabor Szabo: Go ahead. No, no, it's okay. It was on my screen in the.</p>
<p>15
00:02:11.030 --&gt; 00:02:22.899
Leah Levy: yeah. So I'm a data scientist in the for the Uk government. I'm currently get living in in England. I'm hoping to move to Israel soon. So be nice to meet everybody.</p>
<p>16
00:02:23.611 --&gt; 00:02:42.418
Leah Levy: I'm gonna talk today about streamlit, which is a python library and how I use it to like deploy machine learning models and just build web apps. I'll put my contact details in the chat. If you wanna connect with me on Linkedin or follow me on Github.</p>
<p>17
00:02:43.200 --&gt; 00:02:45.630
Leah Levy: be great to be great, to connect</p>
<p>18
00:02:46.694 --&gt; 00:02:55.610
Leah Levy: and please feel free to ask questions as we go along. I've I can see the chat. So if you want to put messages in the chat or come off mute, whatever you prefer.</p>
<p>19
00:02:56.760 --&gt; 00:03:02.790
Leah Levy: So streamlet is a python library. It's open source.</p>
<p>20
00:03:02.790 --&gt; 00:03:10.029
Gabor Szabo: Sorry. Sorry. Just one note, I mean, right now we can see both you and and this and the slides.</p>
<p>21
00:03:10.300 --&gt; 00:03:11.630
Gabor Szabo: and.</p>
<p>22
00:03:11.900 --&gt; 00:03:12.880
Leah Levy: Oh, okay.</p>
<p>23
00:03:12.880 --&gt; 00:03:22.350
Gabor Szabo: So maybe you want to turn off your your camera, or or just show the slides, because in the recording you you will be seen, anyway, probably at the top right corner</p>
<p>24
00:03:23.000 --&gt; 00:03:25.769
Gabor Szabo: that now you can. I can see myself.</p>
<p>25
00:03:26.950 --&gt; 00:03:28.720
Leah Levy: I'll share again. Hold on.</p>
<p>26
00:03:29.060 --&gt; 00:03:29.850
Gabor Szabo: Okay.</p>
<p>27
00:03:36.170 --&gt; 00:03:40.759
Leah Levy: Oh, yeah, it was on a strange I think I was messing around with the settings before.</p>
<p>28
00:03:40.960 --&gt; 00:03:41.710
Leah Levy: Okay.</p>
<p>29
00:03:46.550 --&gt; 00:03:47.979
Gabor Szabo: Oh, now it's good!</p>
<p>30
00:03:48.590 --&gt; 00:03:49.296
Leah Levy: Yeah, okay.</p>
<p>31
00:03:50.100 --&gt; 00:03:50.510
Gabor Szabo: Okay.</p>
<p>32
00:03:51.450 --&gt; 00:03:56.100
Leah Levy: Thanks for letting me know. So you can see just like there's the slideshow.</p>
<p>33
00:03:57.130 --&gt; 00:03:57.710
Gabor Szabo: Yeah.</p>
<p>34
00:03:58.130 --&gt; 00:03:58.790
Leah Levy: Yeah,</p>
<p>35
00:04:02.210 --&gt; 00:04:22.959
Leah Levy: so how many of you ever perhaps worked on a data science project? You've built a machine learning model. And you've wished you could deploy it quickly for others to use. Or perhaps you've built a web application. But front end development isn't really your expertise. It's too complicated. So this is where stream it really comes into its own.</p>
<p>36
00:04:23.270 --&gt; 00:04:41.560
Leah Levy: It makes it easy for python developers to and data scientists to create beautiful interactive web apps without needing any front end development expertise. So it's lightweight. It's really easy to use doesn't require, you know, hundreds of lines of code.</p>
<p>37
00:04:41.620 --&gt; 00:04:56.920
Leah Levy: And there's a really strong community online. So there's people building like add ons constantly. And there's also a strong community of people happy to answer questions and help if you have any issues.</p>
<p>38
00:05:01.950 --&gt; 00:05:10.450
Leah Levy: So Streamline allows you to turn your python scripts into interactive web applications and just a few lines of code. So you don't need to be. Know any like</p>
<p>39
00:05:10.620 --&gt; 00:05:19.650
Leah Levy: break traditional web frameworks like Flask or Django. You don't need any HTML Css. Or javascript. It's all python.</p>
<p>40
00:05:20.640 --&gt; 00:05:32.522
Leah Levy: You can easily customize your web application using like sliders, buttons, check boxes making it interactive. And you're able to capture user, input too.</p>
<p>41
00:05:34.180 --&gt; 00:05:53.920
Leah Levy: The app automatically updates when you're coding in in whatever id prefer, like visual studio code, as soon as you update the code and save it then updates in the in the actual application. I'll show I'll do a demo of it a bit later, so you could see exactly what I mean.</p>
<p>42
00:05:55.020 --&gt; 00:06:00.160
Leah Levy: And but that just like makes development much faster. So you can see your changes as you go along.</p>
<p>43
00:06:00.400 --&gt; 00:06:05.189
Leah Levy: And it works really well with other python libraries, popular ones like</p>
<p>44
00:06:05.370 --&gt; 00:06:11.689
Leah Levy: numpy pandas plotly, even data science ones like tensorflow and scikit-learn.</p>
<p>45
00:06:11.900 --&gt; 00:06:18.939
Leah Levy: So it enables you to visualize data. You can build dashboards, graphs, charts and also</p>
<p>46
00:06:19.470 --&gt; 00:06:23.439
Leah Levy: integrate machine learning models directly into your application.</p>
<p>47
00:06:26.840 --&gt; 00:06:51.719
Leah Levy: So a bit about deploying machine learning models so often. In data science, you go. You put a lot of work into creating it in a model. You've got your data, you've cleaned it. You've built a model. You've tested it, optimized it. You've evaluated the performance. But the real key is to kind of surface that to your end users or your clients</p>
<p>48
00:06:52.170 --&gt; 00:07:01.079
Leah Levy: and using stream that makes it easy. It's quite user friendly interface. And it can handle resource, intensive tasks.</p>
<p>49
00:07:01.690 --&gt; 00:07:03.910
Leah Levy: And it's easy to deploy as well.</p>
<p>50
00:07:04.050 --&gt; 00:07:14.830
Leah Levy: You a basic workflow could be something like loading a pre-trained model from pickle file or on something from hugging face or tensorflow.</p>
<p>51
00:07:16.480 --&gt; 00:07:32.710
Leah Levy: collect input from users. So as soon as they could enter some text. If it's like a chat bot, they could use some sliders and then it uses the machine learning model to make predictions and display the results to users.</p>
<p>52
00:07:33.150 --&gt; 00:07:39.510
Leah Levy: So I've created a couple of examples of</p>
<p>53
00:07:39.610 --&gt; 00:07:46.359
Leah Levy: what it can do. Just like kind of basic one's a dashboard and one's uses a pre-trained machine learning model.</p>
<p>54
00:07:50.010 --&gt; 00:07:59.530
Leah Levy: I'm gonna I've taken some screenshots, but I think it'd be better to just show it live. So I'm just gonna have a go showing like, can you see this.</p>
<p>55
00:08:00.660 --&gt; 00:08:01.400
Gabor Szabo: Then like.</p>
<p>56
00:08:03.980 --&gt; 00:08:06.170
Leah Levy: Because, yeah, the code.</p>
<p>57
00:08:06.620 --&gt; 00:08:07.280
Gabor Szabo: Yes.</p>
<p>58
00:08:09.100 --&gt; 00:08:14.939
Leah Levy: So I've just pre pre-built like this very basic dashboard.</p>
<p>59
00:08:15.070 --&gt; 00:08:17.750
Leah Levy: What it does is</p>
<p>60
00:08:18.230 --&gt; 00:08:24.410
Leah Levy: I've got some dummy data about British culture. I thought I'd make it relative to me.</p>
<p>61
00:08:25.030 --&gt; 00:08:27.209
Leah Levy: and I've just put it into a.</p>
<p>62
00:08:27.210 --&gt; 00:08:29.650
Gabor Szabo: Saying, maybe you can enlarge the fonts a little bit.</p>
<p>63
00:08:32.220 --&gt; 00:08:32.970
Leah Levy: Yeah, let me.</p>
<p>64
00:08:33.276 --&gt; 00:08:33.889
Gabor Szabo: Yeah. Thanks.</p>
<p>65
00:08:35.520 --&gt; 00:08:36.020
Gabor Szabo: Think so.</p>
<p>66
00:08:38.250 --&gt; 00:08:38.909
Gabor Szabo: Noon.</p>
<p>67
00:08:42.010 --&gt; 00:08:43.020
Gabor Szabo: Okay, well.</p>
<p>68
00:08:43.020 --&gt; 00:08:43.420
Leah Levy: Oh, 2.</p>
<p>69
00:08:43.429 --&gt; 00:08:47.150
Gabor Szabo: Yeah, yeah, no, it's good. I see.</p>
<p>70
00:08:48.430 --&gt; 00:08:49.330
Leah Levy: Pardon.</p>
<p>71
00:08:50.920 --&gt; 00:08:51.859
Gabor Szabo: I think it's fine now.</p>
<p>72
00:08:52.500 --&gt; 00:08:53.440
Leah Levy: Okay?</p>
<p>73
00:08:54.357 --&gt; 00:09:02.049
Leah Levy: So in the terminal I just use the command stream. Let run. So I do. Stream lit.</p>
<p>74
00:09:02.210 --&gt; 00:09:06.640
Leah Levy: run, and then the name of your file.</p>
<p>75
00:09:06.830 --&gt; 00:09:13.420
Leah Levy: In this case it's in the app folder, and it's called English chat, Hi.</p>
<p>76
00:09:16.530 --&gt; 00:09:21.744
Leah Levy: and it takes a couple of seconds and it should pop up in like your browser.</p>
<p>77
00:09:23.780 --&gt; 00:09:28.030
Leah Levy: so here you can have you stream it up in your browser. It's popped up here.</p>
<p>78
00:09:29.430 --&gt; 00:09:37.429
Leah Levy: and here's the very basic app that I built in the top right hand corner. You see it running</p>
<p>79
00:09:38.360 --&gt; 00:09:42.490
Leah Levy: and then there's a option here to deploy. If you want. If you're ready to deploy it.</p>
<p>80
00:09:43.405 --&gt; 00:09:45.339
Leah Levy: Oh, what's this?</p>
<p>81
00:09:48.310 --&gt; 00:09:49.360
Leah Levy: Okay?</p>
<p>82
00:09:56.720 --&gt; 00:10:03.289
Leah Levy: If this doesn't work, I will just show you the screenshot instead.</p>
<p>83
00:10:03.890 --&gt; 00:10:05.980
Leah Levy: Okay, so I've saved it here.</p>
<p>84
00:10:06.750 --&gt; 00:10:10.696
Leah Levy: And you'll see an example now, actually, of</p>
<p>85
00:10:12.450 --&gt; 00:10:23.590
Leah Levy: of how it updates in real time. So I've updated the file, the source file. And you see in the top right hand corner. Now there's an option I'll just zoom in and make it a bit bigger.</p>
<p>86
00:10:25.070 --&gt; 00:10:28.779
Leah Levy: but it says source file change, and it gives you the option. Rerun</p>
<p>87
00:10:29.161 --&gt; 00:10:33.799
Leah Levy: and they can click, always rerun. So I don't have to click that every time. So if I try that.</p>
<p>88
00:10:34.150 --&gt; 00:10:38.630
Leah Levy: and it's work now. So this is just like a</p>
<p>89
00:10:39.100 --&gt; 00:10:46.520
Leah Levy: basic application. There's a dropdown menu here, so you can select the category if I wanted to. Just see landmarks. See that</p>
<p>90
00:10:46.830 --&gt; 00:10:50.740
Leah Levy: some reason it's giving me error sports</p>
<p>91
00:10:54.010 --&gt; 00:11:03.280
Leah Levy: and the size of each bubble is the size of visitors per year, and you can hover over, and it gives you a little bit more information. And then if</p>
<p>92
00:11:04.900 --&gt; 00:11:12.589
Leah Levy: yeah, I think I think the map plot little bit is broken on bottom. So that's 1 example. The next</p>
<p>93
00:11:13.270 --&gt; 00:11:22.030
Leah Levy: application. Let me just cancel this. I'll just do control. C, let's run another</p>
<p>94
00:11:23.102 --&gt; 00:11:30.350
Leah Levy: another. This is more of like a machine learning one. So I just run, stream, let run and</p>
<p>95
00:11:31.820 --&gt; 00:11:32.980
Leah Levy: spell check.</p>
<p>96
00:11:50.550 --&gt; 00:11:54.489
Leah Levy: Oh, I know why it's giving me an error because I haven't installed the packages.</p>
<p>97
00:12:09.660 --&gt; 00:12:13.009
Leah Levy: I'm actually just using poetry library, which</p>
<p>98
00:12:13.200 --&gt; 00:12:34.820
Leah Levy: it's it's not sure how common, how widely it's used. But it's a 3rd party. It's like A, it's not an inbuilt typically, you might manage your libraries, use your dependencies using like requirements, dot text file and then have a virtual create a virtual environment. But I'm just.</p>
<p>99
00:12:35.420 --&gt; 00:12:45.569
Leah Levy: I've got used to using poetry, which is another like dependency package. So and that's just</p>
<p>100
00:12:46.130 --&gt; 00:12:48.370
Leah Levy: just to clarify exactly what it is.</p>
<p>101
00:12:51.940 --&gt; 00:12:58.580
Leah Levy: Yeah, that's not working. So let me just show you on the on the slide show.</p>
<p>102
00:12:59.580 --&gt; 00:13:00.750
Leah Levy: Sorry?</p>
<p>103
00:13:09.314 --&gt; 00:13:18.655
Leah Levy: What this is. Is. It imports text blog, which is a light, very lightweight kind of natural language processing library</p>
<p>104
00:13:20.020 --&gt; 00:13:26.119
Leah Levy: and what happens is you put in your spelling. So you put in some text. In this case</p>
<p>105
00:13:26.530 --&gt; 00:13:35.059
Leah Levy: I'm so bad at spelling spell really wrong, and then it returns the correct spelling and then in the top right. You can see it's very kind of</p>
<p>106
00:13:35.320 --&gt; 00:13:47.810
Leah Levy: simply. There's only like 16 lines of code. It's quite lightweight. And I've put a link here to more community projects. You can see on on the stream website.</p>
<p>107
00:13:48.440 --&gt; 00:13:50.030
Leah Levy: they've actually got</p>
<p>108
00:13:51.100 --&gt; 00:13:59.750
Leah Levy: community projects. You can kind of get an idea of flavor, of exactly what's possible. So this one's quite cool. This is like a map.</p>
<p>109
00:14:00.445 --&gt; 00:14:06.500
Leah Levy: Application that somebody's built that's called pretty map, where you kind of visualize</p>
<p>110
00:14:07.361 --&gt; 00:14:11.959
Leah Levy: maps in like different, cool, different, cool ways.</p>
<p>111
00:14:13.051 --&gt; 00:14:22.290
Leah Levy: But just so you can get kind of get an idea of like, it's quite personalizable. It doesn't have to look like they did. All the applications don't necessarily have to look the same.</p>
<p>112
00:14:38.920 --&gt; 00:14:40.470
Leah Levy: Sorry gone too far.</p>
<p>113
00:14:45.890 --&gt; 00:14:53.241
Leah Levy: Okay. So I wanted to talk about deployment. So I mentioned. It's there's different options to deploy.</p>
<p>114
00:14:54.210 --&gt; 00:14:59.230
Leah Levy: Just gonna wait for the slides to kind of sync.</p>
<p>115
00:15:07.560 --&gt; 00:15:08.619
Leah Levy: Not sure.</p>
<p>116
00:15:09.560 --&gt; 00:15:10.799
Leah Levy: Okay, there we go.</p>
<p>117
00:15:13.880 --&gt; 00:15:27.930
Leah Levy: There's a there's a couple of different options you could deploy locally, which is kind of what we've done. Just before we do the stream that run. But in most cases you want to deploy it to a cloud or servers.</p>
<p>118
00:15:28.370 --&gt; 00:15:31.159
Leah Levy: So there's stream that has its own kind of</p>
<p>119
00:15:31.370 --&gt; 00:15:39.799
Leah Levy: built like customized deployment option called the stream community cloud where you can deploy from, get straight from Github.</p>
<p>120
00:15:40.551 --&gt; 00:15:46.568
Leah Levy: But that also supports other deployment options like Docker, Aws</p>
<p>121
00:15:48.475 --&gt; 00:15:53.880
Leah Levy: and all these other options. The another benefit of the community cloud is.</p>
<p>122
00:15:54.720 --&gt; 00:16:12.700
Leah Levy: you can it provides you with analytics data. So how many people have clicked on on your onto your dashboard. Total viewers, most recent viewers, timestamps of people's last visit. So you can kind of get an idea of when people have have used your application.</p>
<p>123
00:16:14.520 --&gt; 00:16:18.800
Leah Levy: So I want to talk about the testing framework in the app.</p>
<p>124
00:16:18.910 --&gt; 00:16:21.500
Leah Levy: This is something.</p>
<p>125
00:16:22.090 --&gt; 00:16:35.319
Leah Levy: Last time I gave this talk at Pi Web in Tel Aviv. Someone asked me about testing. And I thought, Oh, yeah, that's I've not really used the testing framework. So I thought, I put a section in here to to show you kind of how I've done it.</p>
<p>126
00:16:36.415 --&gt; 00:16:58.584
Leah Levy: So stream that has its own. You can use pi test, and and those usual kind of testing frameworks and stream. It has its own framework, which enables developers to build and run headless tests, which I executes the app code directly. So it simulates that user input and inspects the output for correctness.</p>
<p>127
00:16:59.090 --&gt; 00:17:07.560
Leah Levy: for those who don't know headless tests is like a way to run automated browser tests without having, like the user interface.</p>
<p>128
00:17:08.027 --&gt; 00:17:13.299
Leah Levy: So it's a more efficient way of testing the application because it doesn't need to like render the HTML.</p>
<p>129
00:17:13.569 --&gt; 00:17:27.959
Leah Levy: It just sends requests to the server the same way like you would do in a browser, and it's much faster because you don't need to wait for a page to load, and it integrates well into your like any crcd pipelines you might have as well.</p>
<p>130
00:17:29.670 --&gt; 00:17:47.450
Leah Levy: So an example of testing. So on the left hand side. I've written what might be a more traditional way to write a test. So you would import streamlet and also import textblob, which is the library I mentioned before that we used for the spell checker.</p>
<p>131
00:17:47.660 --&gt; 00:17:49.590
Leah Levy: You kind of set up a</p>
<p>132
00:17:50.100 --&gt; 00:17:57.630
Leah Levy: set up the app just as it appears in that, just as what you've to kind of mirror what you've written</p>
<p>133
00:17:58.258 --&gt; 00:18:07.070
Leah Levy: and have some simulated user input and then load the text blob and then run the</p>
<p>134
00:18:07.520 --&gt; 00:18:15.440
Leah Levy: run. The text Blob Library to generate the correct spelling, and then have an assert to correct, to ensure that</p>
<p>135
00:18:15.740 --&gt; 00:18:23.610
Leah Levy: that is, what the output is is what you've expected is that should be the corrected spelling of what you've inputted.</p>
<p>136
00:18:24.489 --&gt; 00:18:32.130
Leah Levy: But on the right, all you need to do is run install the streamer testing framework</p>
<p>137
00:18:32.250 --&gt; 00:18:45.980
Leah Levy: with app test. App test is is what simulates the running of the app, and it provides different methods to set up, manipulate and inspect the app via the Api instead of doing it in the browser</p>
<p>138
00:18:49.370 --&gt; 00:18:57.074
Leah Levy: And then I've just written a function to test the spelling. So you you've got app test, which runs the</p>
<p>139
00:18:57.710 --&gt; 00:19:03.239
Leah Levy: which runs the application as if I was running it in the terminal.</p>
<p>140
00:19:03.950 --&gt; 00:19:09.750
Leah Levy: I is simulate an input of the incorrect spelling and run that.</p>
<p>141
00:19:10.520 --&gt; 00:19:16.360
Leah Levy: and then the assert that the corrected text equals the correct spelling.</p>
<p>142
00:19:17.358 --&gt; 00:19:25.871
Leah Levy: And then I've just written some a couple of other tests this next function just asserts that the</p>
<p>143
00:19:27.180 --&gt; 00:19:33.809
Leah Levy: the application is running and not producing any exception errors. And then this one tests that the title is</p>
<p>144
00:19:33.990 --&gt; 00:19:36.970
Leah Levy: displaying the correct title as we've expected.</p>
<p>145
00:19:39.550 --&gt; 00:19:48.459
Leah Levy: so you'll see it's much quicker. It's fewer lines of code. And you could just run it using like in the terminal using. I test</p>
<p>146
00:19:48.680 --&gt; 00:19:51.339
Leah Levy: as you would like any other testing.</p>
<p>147
00:19:54.660 --&gt; 00:20:03.330
Leah Levy: you can add multiple pages to an app. So you kind of create a new pages folder in the same folder where your application is running</p>
<p>148
00:20:03.934 --&gt; 00:20:15.910
Leah Levy: and then you can give it. You can, whatever you name the whatever you name. The file is what kind of appears on the sidebar and you can amend the</p>
<p>149
00:20:17.040 --&gt; 00:20:23.254
Leah Levy: you can amend the content as you would like any other application. I've put a link in here.</p>
<p>150
00:20:24.030 --&gt; 00:20:25.680
Leah Levy: just so you can kind of</p>
<p>151
00:20:27.610 --&gt; 00:20:30.949
Leah Levy: I was gonna show how to</p>
<p>152
00:20:32.609 --&gt; 00:20:36.229
Leah Levy: it. It gives a good example rather than me, like giving</p>
<p>153
00:20:36.680 --&gt; 00:20:41.279
Leah Levy: setting up lots of different ones. But you can kind of see the from the. It's got a good</p>
<p>154
00:20:41.750 --&gt; 00:20:44.358
Leah Levy: kind of demo page.</p>
<p>155
00:20:49.446 --&gt; 00:20:53.703
Leah Levy: hey? It's got a hello page. It's got a plotting demo.</p>
<p>156
00:20:54.980 --&gt; 00:20:58.089
Leah Levy: yeah, you can have a look in your own time if you like.</p>
<p>157
00:21:24.610 --&gt; 00:21:27.299
Leah Levy: Sorry. My computer's running super slow.</p>
<p>158
00:21:30.410 --&gt; 00:21:32.449
Gabor Szabo: So I just I was just saying.</p>
<p>159
00:21:33.350 --&gt; 00:21:38.320
Leah Levy: It's also supports chat inputs. So oops.</p>
<p>160
00:21:38.920 --&gt; 00:21:47.796
Leah Levy: So if you if you everybody wants to build their own chat bots nowadays, and it provides support for that</p>
<p>161
00:21:48.380 --&gt; 00:21:55.700
Leah Levy: where it kind of mimics a user. And it's got like an assistant with these like different emojis</p>
<p>162
00:21:56.242 --&gt; 00:22:02.159
Leah Levy: so as if you were speaking to a person. Similar to kind of.</p>
<p>163
00:22:02.720 --&gt; 00:22:07.300
Leah Levy: you know, like chat. Gpt's got an assistant kind of answer.</p>
<p>164
00:22:07.560 --&gt; 00:22:30.921
Leah Levy: You can also like stream, the reply, you know how chat gpt kind of streams it, or writes it word by word, instead of just giving you an answer right away. As if somebody just to like make it look like somebody's typing. You can add a delay as well of like a couple of seconds to make it seem like it's thinking about a reply.</p>
<p>165
00:22:32.280 --&gt; 00:22:52.160
Leah Levy: And different things like that. So this is just a an echo bot which just echoes, echoes whatever you type into it. Obviously not using any large language models. But you can use kind of any large language models that you want, and kind of just plug it in to a streaming dashboard.</p>
<p>166
00:23:01.040 --&gt; 00:23:04.700
Leah Levy: So finally, just some additional features</p>
<p>167
00:23:05.710 --&gt; 00:23:16.739
Leah Levy: which I've oops added kind of some links to. So, as I mentioned before, it's got like a whole wide range of different input widgets. And</p>
<p>168
00:23:17.180 --&gt; 00:23:32.760
Leah Levy: I didn't kind of include them all on the dashboard, because I think that this page actually does it in a nicer way. You can see it's got different buttons, check boxes, feedback options, radio buttons.</p>
<p>169
00:23:33.550 --&gt; 00:23:35.240
Leah Levy: sliders.</p>
<p>170
00:23:35.966 --&gt; 00:23:39.269
Leah Levy: Numeric inputs. Yeah, I could just go on, but</p>
<p>171
00:23:40.150 --&gt; 00:23:49.400
Leah Levy: pretty much you know anything you would need to build a nice looking app. It's got another</p>
<p>172
00:23:49.840 --&gt; 00:23:56.568
Leah Levy: another thing is status elements of like progress bars loading</p>
<p>173
00:23:58.890 --&gt; 00:24:03.326
Leah Levy: call out messages, but error boxes I've used before.</p>
<p>174
00:24:04.080 --&gt; 00:24:08.824
Leah Levy: I can't say I've used the balloon ones, but that looks fun</p>
<p>175
00:24:12.470 --&gt; 00:24:20.803
Leah Levy: And it also has integration for like interactive maps, as we saw before, like that, the map application that I</p>
<p>176
00:24:21.340 --&gt; 00:24:27.209
Leah Levy: And it's also you can build interactive charts with like plotly and other similar libraries.</p>
<p>177
00:24:27.640 --&gt; 00:24:36.139
Leah Levy: You can cache large data sets. So particularly when you're working with machine learning models. You're often dealing with</p>
<p>178
00:24:36.250 --&gt; 00:24:48.150
Leah Levy: lot really, really, large data sets which you can cache into memory. So rather than reloading the reloading like a data set each time it can just store it in memory.</p>
<p>179
00:24:50.161 --&gt; 00:25:12.448
Leah Levy: From a safety point of view. I've just looked at the privacy policy and took this this 4th bullet point straight from the privacy policy which is stream that cannot see and does not store any information contained inside stream. The apps like text shots and images, but as general advice, I would say, not to expose sensitive data.</p>
<p>180
00:25:13.020 --&gt; 00:25:17.580
Leah Levy: unless you yeah.</p>
<p>181
00:25:18.310 --&gt; 00:25:40.254
Leah Levy: you can expect, unless you're kind of like it's locked down. It's in a safe, secure environment. And you've got like full access controls and ensure your app is also protected from malicious input, like sequel injections, because, you know any. Any application is susceptible to to being hacked. So I guess just</p>
<p>182
00:25:41.480 --&gt; 00:25:48.060
Leah Levy: be wary of this is probably no different either to to malicious input like that.</p>
<p>183
00:25:52.590 --&gt; 00:25:53.465
Leah Levy: But</p>
<p>184
00:25:54.630 --&gt; 00:26:01.731
Leah Levy: yeah, that's all I prepared for now, but happy to answer questions and go into into more detail on different bits.</p>
<p>185
00:26:03.040 --&gt; 00:26:07.319
Leah Levy: but thank you for your time, and happy to answer any questions.</p>
<p>186
00:26:12.910 --&gt; 00:26:15.524
Gabor Szabo: So thank you for the presentation.</p>
<p>187
00:26:17.190 --&gt; 00:26:25.759
Gabor Szabo: I heard it the second time. I really like the testing part. I always think about testing when I, whatever I try to show.</p>
<p>188
00:26:25.890 --&gt; 00:26:26.970
Gabor Szabo: And</p>
<p>189
00:26:27.810 --&gt; 00:26:38.989
Gabor Szabo: if anyone has questions, then please ask. Now we can also, after the recording, after we stop the recording, we can stay around and have a conversation without the recording.</p>
<p>190
00:26:39.240 --&gt; 00:26:45.520
Gabor Szabo: But anyway, it seems that there are no questions now.</p>
<p>191
00:26:46.440 --&gt; 00:26:50.600
Gabor Szabo: So, Leah, thank you very much for for this presentation.</p>
<p>192
00:26:50.780 --&gt; 00:26:56.499
Gabor Szabo: If we'd like to add anything more, I mean, I'll I'll have the links below also the the video.</p>
<p>193
00:26:59.180 --&gt; 00:27:05.545
Gabor Szabo: So thank you for for giving this presentation. And thank you. Thank you. Thanks. Everyone who was attending. And</p>
<p>194
00:27:06.420 --&gt; 00:27:11.800
Gabor Szabo: and everyone who was watching. So please remember to like the video and follow the Channel and see you</p>
<p>195
00:27:11.980 --&gt; 00:27:15.530
Gabor Szabo: at one of our next one of our upcoming events.</p>
<p>196
00:27:15.960 --&gt; 00:27:16.850
Gabor Szabo: Bye, bye.</p>
<p>197
00:27:18.140 --&gt; 00:27:19.260
Leah Levy: Thanks, bye.</p>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>daffodil, data frames for optimized data inspection and logical processing with Ray Lutz</title>
    <summary type="html"><![CDATA[]]></summary>
    <updated>2025-03-06T16:30:01Z</updated>
    <pubDate>2025-03-06T16:30:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/daffodil-data-frames-for-optimized-data-inspection-and-logical-processing" />
    <id>https://python.code-maven.com/daffodil-data-frames-for-optimized-data-inspection-and-logical-processing</id>
    <content type="html"><![CDATA[<iframe width="560" height="315" src="https://www.youtube.com/embed/L9OJtuJVOYg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<p>Speaker: <a href="https://www.linkedin.com/in/raylutz/">Ray Lutz</a></p>
<p><img src="images/ray-lutz.jpeg" alt="Ray Lutz" /></p>
<p><a href="https://github.com/raylutz/daffodil">daffodil</a> (data frames for optimized data inspection and logical (processing)), which can create data frame instances similar to pandas, but using conventional python data types.</p>
<p>This means no conversion to/from the Pandas world, which I have found
from testing has a very high overhead. In fact, unless you plan to do at
least 30 repetitive column-based operations (like sums, etc) then you
should just stay in python world and avoid the conversion time, and you
win. But for many, time is not of the essence, or they stay in Pandas
world and never need any python. The syntax is easy to use and I am
extending it to use SQL database to allow for large table size and use
of the robust joins, etc. The SQL part is under work and not released yet.</p>
<h2 class="title is-4">Transcript</h2>
<p>1
00:00:02.370 --&gt; 00:00:06.679
Gabor Szabo: Hello and welcome to the Code Maven meeting a meeting group</p>
<p>2
00:00:06.860 --&gt; 00:00:12.580
Gabor Szabo: and Youtube Channel. If you are watching this on Youtube, thank you very much for everyone who joined us.</p>
<p>3
00:00:13.080 --&gt; 00:00:17.649
Gabor Szabo: and especially thanks Ray, for giving this talk.</p>
<p>4
00:00:17.790 --&gt; 00:00:26.829
Gabor Szabo: My name is Gabor Sabo. I usually teach python and rust and help companies introduce these languages or introduce testing in these languages.</p>
<p>5
00:00:27.030 --&gt; 00:00:33.439
Gabor Szabo: And I also organize these meetings because I think it's very important to share knowledge and</p>
<p>6
00:00:33.640 --&gt; 00:00:38.700
Gabor Szabo: the Zoom Meetings. And it is online. Events allow us to to</p>
<p>7
00:00:39.040 --&gt; 00:00:47.660
Gabor Szabo: learn from each other, even if they are halfway around the world. And so with that, let me</p>
<p>8
00:00:48.120 --&gt; 00:00:52.799
Gabor Szabo: give the word to you, Ray, and please introduce yourself and and just go ahead.</p>
<p>9
00:00:53.030 --&gt; 00:01:15.399
Gabor Szabo: One thing sorry. Just one thing. Those who are here feel free to ask questions, either in the chat or in the or just speak up. Ray will tell you how it's going to work out. Just remember, if you're recording this, it's going to be in Youtube. So if you don't want to be in your in Youtube, then just write.</p>
<p>10
00:01:15.570 --&gt; 00:01:17.069
Gabor Szabo: So thank you, it's yours.</p>
<p>11
00:01:17.660 --&gt; 00:01:25.920
Ray Lutz: Okay, thank you so much. Gabor. Yes, my name is Ray Lutz. I'm let me go on to the let me share my screen here so we can get started.</p>
<p>12
00:01:27.660 --&gt; 00:01:36.300
Ray Lutz: I am actually not that much that long term of a python user, you know only about maybe 5, 6 years.</p>
<p>13
00:01:36.925 --&gt; 00:01:43.150
Ray Lutz: And then I had quite a wealth of experience before that with other languages, including.</p>
<p>14
00:01:43.320 --&gt; 00:01:58.208
Ray Lutz: you know, assembly language. See? You know, Perl you know, Javascript, all these other kind of languages in one form or another, even though I do really like python. So I I did kind of settle on that</p>
<p>15
00:01:59.190 --&gt; 00:02:00.310
Ray Lutz: for now.</p>
<p>16
00:02:00.670 --&gt; 00:02:08.970
Ray Lutz: And so essentially, today, we're going to talk about this package called Daffodil.</p>
<p>17
00:02:09.110 --&gt; 00:02:19.119
Ray Lutz: And it is data frames for optimized data inspection and logical processing. I came up with that later, you know, after we've chose the name. But</p>
<p>18
00:02:19.290 --&gt; 00:02:26.149
Ray Lutz: the idea is that you see a lot. Df, if you use pandas, you're talking about data frames. Df, and</p>
<p>19
00:02:26.300 --&gt; 00:02:35.600
Ray Lutz: so we wanted something kind of like that. And we use daf. So you know, throughout the code, if you see daf, you know that it's a daffodil data frame</p>
<p>20
00:02:35.710 --&gt; 00:02:37.769
Ray Lutz: instead of a pandas.</p>
<p>21
00:02:39.390 --&gt; 00:02:43.949
Ray Lutz: And I have a Master's degree, mostly electronics. I did do</p>
<p>22
00:02:44.810 --&gt; 00:02:51.010
Ray Lutz: various medical devices and and document processing in my career.</p>
<p>23
00:02:52.170 --&gt; 00:03:05.290
Ray Lutz: Most recently I'm developing audit engine, which is a ballot image auditing platform for checking elections. And underneath the citizens oversight, which is a nonprofit organization.</p>
<p>24
00:03:05.940 --&gt; 00:03:11.629
Ray Lutz: Now, why, Daffodil, we already have pandas. So why would we need something new?</p>
<p>25
00:03:11.760 --&gt; 00:03:20.499
Ray Lutz: Well, I needed a two-dimensional data type sort of a table structure. And so I started using pandas</p>
<p>26
00:03:21.433 --&gt; 00:03:26.579
Ray Lutz: for almost everything I I use. You know, these 2 dimensional tables are really handy.</p>
<p>27
00:03:26.990 --&gt; 00:03:31.890
Ray Lutz: but it turns out that pandas is mostly designed for numerics and</p>
<p>28
00:03:33.630 --&gt; 00:03:35.880
Ray Lutz: it uses numpy under the hood.</p>
<p>29
00:03:37.400 --&gt; 00:03:46.650
Ray Lutz: and so it's slow, really slow for row based operations, and some of them are now not even allowed. So you can't do an append</p>
<p>30
00:03:46.920 --&gt; 00:03:51.099
Ray Lutz: like a Panda row. Seems like a basic thing you might want to do</p>
<p>31
00:03:51.290 --&gt; 00:03:58.989
Ray Lutz: that's now not supported at all in pandas, because they know it's so desperately a disaster.</p>
<p>32
00:03:59.756 --&gt; 00:04:04.090
Ray Lutz: So then you have to go over and and use something else. If you want to do that sort of thing</p>
<p>33
00:04:05.070 --&gt; 00:04:06.400
Ray Lutz: and</p>
<p>34
00:04:07.280 --&gt; 00:04:16.359
Ray Lutz: and also apply, they say, don't use, apply and apply is kind of a handy thing, which means you go row by row, and you apply some function to it</p>
<p>35
00:04:16.760 --&gt; 00:04:18.329
Ray Lutz: at each row.</p>
<p>36
00:04:18.519 --&gt; 00:04:22.530
Ray Lutz: And so you can't do that either, they said. We're deprecating all these things.</p>
<p>37
00:04:22.740 --&gt; 00:04:27.689
Ray Lutz: I think you can still do apply. But they say, you know, it's really not recommended at all.</p>
<p>38
00:04:28.470 --&gt; 00:04:29.809
Ray Lutz: And then</p>
<p>39
00:04:31.010 --&gt; 00:04:39.919
Ray Lutz: it turns out also, when we're using files that are kind of a weird formats. Pandas assumes a lot when it reads them in.</p>
<p>40
00:04:40.090 --&gt; 00:04:46.950
Ray Lutz: and you have to go jump through a lot of hoops to get it to just read it in like like something without doing anything.</p>
<p>41
00:04:47.090 --&gt; 00:04:49.320
Ray Lutz: and then convert things as you go.</p>
<p>42
00:04:50.075 --&gt; 00:04:53.209
Ray Lutz: It has some other problems, too, and we'll get into that.</p>
<p>43
00:04:53.360 --&gt; 00:05:14.250
Ray Lutz: So this is when I started looking for another data type, and I had some various ones that I started using. And I ended up standardizing on this type of a two-dimensional data frame which is based on a list of lists. I call it a lol doesn't mean laughing out loud. It's a list of list type.</p>
<p>44
00:05:14.800 --&gt; 00:05:17.030
Ray Lutz: And so it's a</p>
<p>45
00:05:17.430 --&gt; 00:05:27.109
Ray Lutz: it's a python list. And in each each of these lists you have a additional list, and it's it's rectangular in form.</p>
<p>46
00:05:27.360 --&gt; 00:05:33.910
Ray Lutz: So every single row is the same length, and it has a certain thing. So it's it's a rectangular</p>
<p>47
00:05:34.130 --&gt; 00:05:39.620
Ray Lutz: array, but it's not the array type. It's a list of lists. So it's easy to add to.</p>
<p>48
00:05:39.780 --&gt; 00:05:53.780
Ray Lutz: relatively easy to splice and insert insert rows or columns. You can do a lot of things fairly easily, mostly inserting and and rows as easy columns. Not quite so easy. But</p>
<p>49
00:05:55.260 --&gt; 00:05:58.310
Ray Lutz: it's fairly malleable. And then</p>
<p>50
00:05:58.450 --&gt; 00:06:07.459
Ray Lutz: also you can put anything at all in any one of these cells, and python will handle it just fine, so you could put a whole pandas array in here. If you want.</p>
<p>51
00:06:07.910 --&gt; 00:06:13.879
Ray Lutz: you could put a whole numpy array of a million things in one cell if you want. Okay, so that's</p>
<p>52
00:06:14.000 --&gt; 00:06:15.800
Ray Lutz: it's very versatile that way.</p>
<p>53
00:06:16.280 --&gt; 00:06:27.199
Ray Lutz: So the basic thing is that you have this array, which is just numbered, and the numbers here don't stick to the columns and rows like they do in pandas?</p>
<p>54
00:06:28.680 --&gt; 00:06:36.939
Ray Lutz: maybe other things. They they float like they would in a regular spreadsheet. So if you move the the rows around, the numbers of the rows</p>
<p>55
00:06:37.130 --&gt; 00:06:42.429
Ray Lutz: are going to stay in the same order, even though you might have moved something up there and so forth.</p>
<p>56
00:06:42.870 --&gt; 00:06:47.700
Ray Lutz: But then you can also optionally have names for each column.</p>
<p>57
00:06:47.940 --&gt; 00:06:56.250
Ray Lutz: data types for the name for the columns and a separate type data types object that explains what those are.</p>
<p>58
00:06:56.420 --&gt; 00:07:20.869
Ray Lutz: And then Optional row keys. Okay, these are both dictionaries. So the Header Dictionary, HD. And the Row Keys Key Dictionary are a special type of dictionary which gives you the number of the column, or the number of the row in the dictionary, so I don't know what you call this exactly, but I end up calling it a keyed list.</p>
<p>59
00:07:21.060 --&gt; 00:07:27.140
Ray Lutz: In other words, this, this is this is the the the key.</p>
<p>60
00:07:27.430 --&gt; 00:07:31.920
Ray Lutz: We'll go into the key list later. But but essentially this is the key.</p>
<p>61
00:07:32.310 --&gt; 00:07:36.460
Ray Lutz: and this is the number that refers to an item in a list.</p>
<p>62
00:07:36.860 --&gt; 00:07:37.940
Ray Lutz: And</p>
<p>63
00:07:39.230 --&gt; 00:07:47.170
Ray Lutz: so your your dictionary would have a a key, and the value is always 0 1, 2, 3, 4, and so forth.</p>
<p>64
00:07:47.350 --&gt; 00:08:04.169
Ray Lutz: And there isn't a standard function for this in python like there is like like you can have from keys, and you can give it a single value, and it can have nones all the way, or zeros, whatever you want all the way through. But it doesn't have it automatically. But it's easy to make.</p>
<p>65
00:08:04.410 --&gt; 00:08:06.249
Ray Lutz: So this is what it looks like.</p>
<p>66
00:08:09.330 --&gt; 00:08:20.650
Ray Lutz: Now, as I said, the Row keys and the Header Dictionary are dictionaries, but these are all optional. You could start with nothing, just an array of list of lists, and you still get all the functionality.</p>
<p>67
00:08:20.950 --&gt; 00:08:23.839
Ray Lutz: But you would have to be using these indexes here.</p>
<p>68
00:08:24.280 --&gt; 00:08:25.830
Ray Lutz: All right, let's go on to next.</p>
<p>69
00:08:26.100 --&gt; 00:08:30.579
Ray Lutz: So essentially what my problem was this, if you have.</p>
<p>70
00:08:31.090 --&gt; 00:08:36.630
Ray Lutz: you want to use pandas, and you import pandas here. And you say, I want to start a new data frame.</p>
<p>71
00:08:37.280 --&gt; 00:08:43.970
Ray Lutz: and let's say you go through a bunch of Urls, and you harvest stuff from web pages, and you want to append to this array.</p>
<p>72
00:08:45.200 --&gt; 00:08:46.570
Ray Lutz: If you say</p>
<p>73
00:08:46.690 --&gt; 00:08:55.020
Ray Lutz: my data frame dot, append the web page metadata. You just take a dictionary, and you want to add it to the bottom of the pandas array.</p>
<p>74
00:08:55.250 --&gt; 00:08:59.530
Ray Lutz: It's horrible! And this, in fact, this has been banned by</p>
<p>75
00:08:59.970 --&gt; 00:09:02.950
Ray Lutz: the Pandas people. You can't append anymore.</p>
<p>76
00:09:03.190 --&gt; 00:09:06.760
Ray Lutz: They they just said, This doesn't exist. That's how bad it is.</p>
<p>77
00:09:06.860 --&gt; 00:09:09.240
Ray Lutz: Now, what were they doing? Why is it so bad.</p>
<p>78
00:09:09.420 --&gt; 00:09:14.370
Ray Lutz: It's because what pandas is is, let me go back a second.</p>
<p>79
00:09:15.120 --&gt; 00:09:18.980
Ray Lutz: I gotta. What is it? Shift to go back control?</p>
<p>80
00:09:20.630 --&gt; 00:09:22.279
Ray Lutz: I gotta go with the keys.</p>
<p>81
00:09:23.050 --&gt; 00:09:31.609
Ray Lutz: Okay, so what pandas is is essentially a numpy array vertically right here in a dictionary</p>
<p>82
00:09:32.010 --&gt; 00:09:36.450
Ray Lutz: where you have the name of the dictionary, and the value</p>
<p>83
00:09:36.650 --&gt; 00:09:41.830
Ray Lutz: is a numpy array vertically, and you've got to think of it that way, and they're all the same length.</p>
<p>84
00:09:42.370 --&gt; 00:09:45.710
Ray Lutz: So the numpy array has data in</p>
<p>85
00:09:48.260 --&gt; 00:09:57.079
Ray Lutz: numpy arrays. The data is, is, each value is like rammed up against each other. There's nothing else unlike Python, where</p>
<p>86
00:09:57.270 --&gt; 00:10:06.190
Ray Lutz: even an integer, or whatever you have in here takes quite a bit of overhead. Usually it'll be like, I think, 28 Byte, just to represent an integer. There's a lot of overhead generally.</p>
<p>87
00:10:06.540 --&gt; 00:10:12.820
Ray Lutz: and if you have, if you put a dictionary in each row, then you have the key for each one. That's I'll get into that in a second.</p>
<p>88
00:10:13.010 --&gt; 00:10:19.259
Ray Lutz: My point, though, is that in a pandas array you have the the name, and you have a</p>
<p>89
00:10:20.950 --&gt; 00:10:28.160
Ray Lutz: numpy array, and if you want to add to the bottom. You have to create all new numpy arrays, or us add to each one.</p>
<p>90
00:10:28.480 --&gt; 00:10:32.420
Ray Lutz: They don't let you just add each one. They they create a whole new array every time.</p>
<p>91
00:10:32.770 --&gt; 00:10:37.929
Ray Lutz: so they copy it over and add to the bottom, copy it over, add to the bottom, copy it over it. That's how they do it.</p>
<p>92
00:10:38.240 --&gt; 00:10:39.739
Ray Lutz: And so it takes a long time</p>
<p>93
00:10:41.630 --&gt; 00:10:51.909
Ray Lutz: if you're appending. So they they basically have disallowed this. So if you're not going to do that. Then you can do this. You can say, I want to make a list of dictionaries. I call it a lod.</p>
<p>94
00:10:52.460 --&gt; 00:10:56.850
Ray Lutz: Okay? And it's a list of dictionaries with string keys and anything inside.</p>
<p>95
00:10:57.230 --&gt; 00:11:05.030
Ray Lutz: And then you read the web page and you put your metadata dict and you append to the list of dictionaries. This will work fine.</p>
<p>96
00:11:06.100 --&gt; 00:11:06.680
Gabor Szabo: Be fast</p>
<p>97
00:11:06.840 --&gt; 00:11:15.420
Gabor Szabo: sorry. Let me just say something related to this. It's interesting, because in, in, I think in both in go and in rust</p>
<p>98
00:11:16.160 --&gt; 00:11:21.789
Gabor Szabo: this you can. You can allocate more. Place space for these arrays.</p>
<p>99
00:11:22.090 --&gt; 00:11:47.780
Gabor Szabo: even if you don't use them. So you can say that. Okay, I'm going to have at the end. I'm going to have a hundred or 1,000 long of these vectors or arrays right now. I have one item in there and then, whenever you so, the memory is already allocated, so you can append up till 1,000 without this overhead of recreating the whole array.</p>
<p>100
00:11:48.210 --&gt; 00:11:58.869
Ray Lutz: They could have done a better job in pandas because they would not need to copy over the whole thing. I didn't even know they were doing that when I 1st started.</p>
<p>101
00:11:59.030 --&gt; 00:12:07.589
Ray Lutz: And so I noticed when I got you know, when the array started to get pretty big that it just started to slow down to a snail space. And so what is this? Well.</p>
<p>102
00:12:07.710 --&gt; 00:12:19.159
Ray Lutz: and then, in the documentation it says, Don't do this. What you're going to have to do is create something else, a list of dictionaries, and then at one fell swoop take your list of dictionaries and convert it into a data frame.</p>
<p>103
00:12:19.460 --&gt; 00:12:21.690
Ray Lutz: and then it'll be reasonably fast.</p>
<p>104
00:12:22.100 --&gt; 00:12:24.400
Ray Lutz: But this turns out, is very slow.</p>
<p>105
00:12:26.672 --&gt; 00:12:33.759
Ray Lutz: But it's way faster than than the appending. Okay, so if you're going through and appending to the bottom of the array.</p>
<p>106
00:12:35.960 --&gt; 00:12:42.792
Ray Lutz: this will be faster. But then this part right here is actually kind of slow. But if that's all you're gonna do. And you're just gonna write it out to a cash flow</p>
<p>107
00:12:43.140 --&gt; 00:12:44.440
Ray Lutz: Csv file.</p>
<p>108
00:12:44.650 --&gt; 00:12:50.110
Ray Lutz: Then then you've just wasted a lot of time because you didn't need to go through this here</p>
<p>109
00:12:50.700 --&gt; 00:13:01.689
Ray Lutz: you could. You could just write it straight out. But if you did do a couple of things with it before you did that, you know you you maybe summed everything one time, and you added everything up.</p>
<p>110
00:13:02.681 --&gt; 00:13:08.649
Ray Lutz: Maybe you did some other manipulation. You thought being in Panda's world was was a good idea.</p>
<p>111
00:13:09.272 --&gt; 00:13:11.810
Ray Lutz: But then you had this overhead of doing this.</p>
<p>112
00:13:12.020 --&gt; 00:13:16.770
Ray Lutz: So this works. But it turns out this is very slow, and when you time it.</p>
<p>113
00:13:17.350 --&gt; 00:13:29.710
Ray Lutz: going from a list of dictionaries into pandas with, this is a 1 million integer table, a thousand by a thousand. Okay, that's the size table that we're using for our benchmark.</p>
<p>114
00:13:30.230 --&gt; 00:13:40.079
Ray Lutz: Now, would Pandas normally have a thousand columns? No, right? Because most Panda, you know, most data tables have. Yeah, very few columns.</p>
<p>115
00:13:40.280 --&gt; 00:13:43.350
Ray Lutz: Usually. Yeah, 20 to 30 columns is a big one.</p>
<p>116
00:13:44.218 --&gt; 00:13:49.599
Ray Lutz: For the data tables I'm working with. They have a lot of columns. Okay, like</p>
<p>117
00:13:49.740 --&gt; 00:13:57.470
Ray Lutz: something with 5,000 columns is is pretty big, but you'll see stuff under that and a lot of it. 3, 400 columns.</p>
<p>118
00:13:57.570 --&gt; 00:14:00.429
Ray Lutz: So a thousand by 1,000, not unusual, that I see.</p>
<p>119
00:14:00.850 --&gt; 00:14:12.250
Ray Lutz: and when you convert this in Daffodil, you take the list of dictionaries and make a list of lists with, you know, formatted for daffodil. It takes 139.</p>
<p>120
00:14:13.810 --&gt; 00:14:15.009
Ray Lutz: What is it?</p>
<p>121
00:14:15.770 --&gt; 00:14:22.660
Ray Lutz: Microseconds! Milliseconds, I believe Pandas takes 5,600 more than 5 seconds</p>
<p>122
00:14:23.640 --&gt; 00:14:25.830
Ray Lutz: more than 5 seconds to convert</p>
<p>123
00:14:25.970 --&gt; 00:14:31.070
Ray Lutz: it into pandas. So it is a ridiculous bottleneck.</p>
<p>124
00:14:31.830 --&gt; 00:14:32.700
Ray Lutz: Okay.</p>
<p>125
00:14:33.350 --&gt; 00:14:50.620
Ray Lutz: it takes 139 like, look at the difference here, and if you multiply this out, even though pandas is really really fast. To do certain things like summing columns is ridiculously fast compared to Daffodil. I can sum columns here at 191 ms.</p>
<p>126
00:14:50.720 --&gt; 00:14:52.359
Ray Lutz: It takes only 4.</p>
<p>127
00:14:52.810 --&gt; 00:14:56.289
Ray Lutz: So that's a big difference. So you do a big savings here.</p>
<p>128
00:14:56.430 --&gt; 00:15:09.510
Ray Lutz: If you do a lot of these, then this might make up for this big difference here, but it takes a lot. It takes at least 30, all columns, doing all columns, all sum standard deviation. You got to do 30 of those</p>
<p>129
00:15:10.120 --&gt; 00:15:12.950
Ray Lutz: before you make up for converting it into pandas.</p>
<p>130
00:15:14.670 --&gt; 00:15:30.780
Ray Lutz: So for just a few things like summing columns or something, or just manipulating the data a little bit. You're just better off not getting into pandas because of this ridiculous conversion factor. Now, I tried to get around this problem here.</p>
<p>131
00:15:31.560 --&gt; 00:15:34.549
Ray Lutz: and they're also another problem with pandas.</p>
<p>132
00:15:34.760 --&gt; 00:15:38.289
Ray Lutz: This is integers as soon as you add a string.</p>
<p>133
00:15:39.620 --&gt; 00:15:49.650
Ray Lutz: and and this is the size the size here is 38 MB. Believe it's megabytes for</p>
<p>134
00:15:49.750 --&gt; 00:15:53.450
Ray Lutz: a million integers, and</p>
<p>135
00:15:55.600 --&gt; 00:16:05.859
Ray Lutz: they they're only at 9.3, so it's quite a bit more compact and a pandas. Right, if you have just integers or just floats.</p>
<p>136
00:16:06.190 --&gt; 00:16:12.950
Ray Lutz: but if you get a string. Then this goes up and becomes quite a bit larger by 10 times</p>
<p>137
00:16:13.730 --&gt; 00:16:18.089
Ray Lutz: what, and quite a bit larger than a daffodil table, which really doesn't go up very much.</p>
<p>138
00:16:19.250 --&gt; 00:16:26.170
Ray Lutz: Okay. So then, you know, numpy, we can convert things to numpy really quickly.</p>
<p>139
00:16:26.440 --&gt; 00:16:33.029
Ray Lutz: 48 ms going to numpy, and from numpy doesn't take very long.</p>
<p>140
00:16:33.290 --&gt; 00:16:36.139
Ray Lutz: and then you can manipulate in numpy</p>
<p>141
00:16:36.630 --&gt; 00:16:47.929
Ray Lutz: one column at a time, or or add columns together, or sum the columns. Whatever you want to do. You can then do it directly in numpy and skip over pandas. Pandas is also a big beast.</p>
<p>142
00:16:48.170 --&gt; 00:16:52.400
Ray Lutz: It takes a long time to load, so if you use daffodil.</p>
<p>143
00:16:52.890 --&gt; 00:17:13.309
Ray Lutz: you import daffodil, and then you you create a daffodil array. You can't click on this, or it goes to the next thing. I can't highlight for that reason, but you create a daffodil. Array my daff, and then I go through the URL, and I get this stuff, and I append the dictionary to Daffodil array</p>
<p>144
00:17:13.369 --&gt; 00:17:25.439
Ray Lutz: done, and then I simply write it out directly, and I skip over this thing here. Now. I was in a bad habit of using these pandas arrays for almost everything, because they're so handy.</p>
<p>145
00:17:25.760 --&gt; 00:17:37.760
Ray Lutz: But little did I know that my code was getting to be really slow because of the conversion of the list of dictionaries over to Pandas was taking a long time every single time and then back.</p>
<p>146
00:17:40.047 --&gt; 00:17:45.620
Ray Lutz: So this is when I came up with daffodil, and and you know what it provides is</p>
<p>147
00:17:46.060 --&gt; 00:17:51.780
Ray Lutz: a way of also indexing into these. Now, if you just used a list of dictionaries.</p>
<p>148
00:17:52.210 --&gt; 00:18:06.820
Ray Lutz: if you think about it? You have for every single row the keys are repeated, and then the next row. You repeat the keys, and you repeat the keys. Repeat the keys. So every single row has a lot of overhead, because the keys are being repeated.</p>
<p>149
00:18:09.260 --&gt; 00:18:22.100
Ray Lutz: so when you crunch that down, you know, if we if we if we look back at the at the data, at the data type here, and you see in the row here, it's only that for each row. You just have a list of values.</p>
<p>150
00:18:22.260 --&gt; 00:18:30.510
Ray Lutz: and you don't have the keys. The keys are one time, only you don't need them every single row. So you crunch all the keys up into one row.</p>
<p>151
00:18:31.090 --&gt; 00:18:40.619
Ray Lutz: and then the indexing goes 2 times. So 1st you get the index of the list, and then you index into the list to get the data item. So it's 1 more step to get to it.</p>
<p>152
00:18:42.290 --&gt; 00:18:48.650
Ray Lutz: But these are lists, and there's a lot of benefits to that</p>
<p>153
00:18:51.000 --&gt; 00:18:59.080
Ray Lutz: 1st of all, we can use this type of indexing in python, which they provide all this as part of their infrastructure, so that you can write code</p>
<p>154
00:18:59.240 --&gt; 00:19:02.600
Ray Lutz: that uses the indexing row column to.</p>
<p>155
00:19:03.040 --&gt; 00:19:07.930
Ray Lutz: or you can use it for anything. But in this case the 1st index is row and then column</p>
<p>156
00:19:10.398 --&gt; 00:19:16.771
Ray Lutz: so their own column can be integers, and and that can be either the array index</p>
<p>157
00:19:17.550 --&gt; 00:19:18.500
Ray Lutz: or</p>
<p>158
00:19:20.240 --&gt; 00:19:28.510
Ray Lutz: it can be, if you want it to be, but you have to go. If it's an integer it assumes it's going to be the array index and not a</p>
<p>159
00:19:28.790 --&gt; 00:19:31.040
Ray Lutz: going through the dictionaries.</p>
<p>160
00:19:31.720 --&gt; 00:19:37.109
Ray Lutz: If you want to go through the array index and you have to use a method. But</p>
<p>161
00:19:38.641 --&gt; 00:19:43.559
Ray Lutz: if it's a string, then it assumes that it's a key into the dictionaries.</p>
<p>162
00:19:44.120 --&gt; 00:19:57.309
Ray Lutz: and it can be a list of integers which, then, is the list of array indices that you want to choose. It can be a list of strings. It can be a list of string keys, so you can pull out individual rows, individual columns.</p>
<p>163
00:19:57.570 --&gt; 00:20:03.520
Ray Lutz: Whatever you want, you can index, an individual position, and the array</p>
<p>164
00:20:03.790 --&gt; 00:20:08.950
Ray Lutz: you can slice and dice it you can give it a</p>
<p>165
00:20:09.200 --&gt; 00:20:14.760
Ray Lutz: a range of indexes like 5 to 10, which gives you 5, 6, 7, 8, 9,</p>
<p>166
00:20:15.510 --&gt; 00:20:20.010
Ray Lutz: or you can do a range of keys in a close, closed</p>
<p>167
00:20:20.340 --&gt; 00:20:28.939
Ray Lutz: kind, of which we use a tuple for that. So it's like from C to A, B, so from like column C to column A, B,</p>
<p>168
00:20:29.140 --&gt; 00:20:36.129
Ray Lutz: because you don't know what's after A B, you can't say go to the next one and back up one. You have to give it a closed range.</p>
<p>169
00:20:36.580 --&gt; 00:20:38.570
Ray Lutz: and so we do it like that.</p>
<p>170
00:20:39.070 --&gt; 00:20:44.780
Ray Lutz: Now you can leave out the column if you want to use all columns kind of like star and sequel expression.</p>
<p>171
00:20:45.670 --&gt; 00:20:49.729
Ray Lutz: But you can just leave that out and and then talk about the row.</p>
<p>172
00:20:50.160 --&gt; 00:21:03.749
Ray Lutz: and you can index in. If you, if you append, the things can be in a different order, and they will always go in correctly and to the right. So here I have it screwed up where C is first, st and it ends up putting C in the right thing.</p>
<p>173
00:21:03.910 --&gt; 00:21:07.139
Ray Lutz: And there's all kinds of examples here of how you would.</p>
<p>174
00:21:07.728 --&gt; 00:21:28.079
Ray Lutz: Take that array that we start with here. 1, 2, 3, n045-67-8910. I don't know why I did that, but then you end up saying, I want to use rows one through 0 and one because that's a slice. You get those. You can get the columns same way. You can use the names of the columns.</p>
<p>175
00:21:28.200 --&gt; 00:21:36.639
Ray Lutz: take all rows and names of columns here, and a list, so forth. We also offer</p>
<p>176
00:21:36.830 --&gt; 00:21:42.830
Ray Lutz: a list of Tuples. I'm sorry a list of ranges which is kind of handy sometimes.</p>
<p>177
00:21:43.690 --&gt; 00:21:45.829
Ray Lutz: and then you can</p>
<p>178
00:21:46.100 --&gt; 00:21:53.070
Ray Lutz: get. You can set a value. You can say, I want to set this value to the entire array. It sets the whole thing</p>
<p>179
00:21:53.270 --&gt; 00:22:02.466
Ray Lutz: you can. You can set the 1st few columns, you can slice it and do this. So this is a setting, so you can set</p>
<p>180
00:22:02.940 --&gt; 00:22:11.539
Ray Lutz: and you can also pop in a list like, if you have a list, you want to put that in the column, you put that in and put a list into the row.</p>
<p>181
00:22:12.385 --&gt; 00:22:17.500
Ray Lutz: You can put another daffodil array in, and it will put in this, that.</p>
<p>182
00:22:18.330 --&gt; 00:22:24.870
Ray Lutz: for you know, whatever the rectangular region is, Boeing can put that in all those things work.</p>
<p>183
00:22:26.600 --&gt; 00:22:33.980
Ray Lutz: Now, there's a return mode which is optional. But we're gonna end up putting this into like when you when you do this</p>
<p>184
00:22:35.236 --&gt; 00:22:36.990
Ray Lutz: indexing here.</p>
<p>185
00:22:37.850 --&gt; 00:22:50.590
Ray Lutz: you want to get the value out in this case, because you want to multiply 2 values together. If you set the return mode to Val, then then it'll give you the value directly. If you just did this you would get a daffodil array</p>
<p>186
00:22:51.560 --&gt; 00:22:53.760
Ray Lutz: of the cell 0 1 1.</p>
<p>187
00:22:54.090 --&gt; 00:23:01.039
Ray Lutz: I'm sorry one comma 0. So would be row. One is row 0 row one, and this would be 5 right.</p>
<p>188
00:23:01.350 --&gt; 00:23:26.629
Ray Lutz: and you would get an array of 5. 1 thing in the middle of the ray. Well, you don't want that. You just wanted the value. So if you said, return the value, then you can just multiply it by this value over here and put that, and the one at 2 2 is 0 1 0 1, 2, 0 1, 2 is 10. Multiply those together 5 times 10, and put that in the cell. 2 comma one, and down here</p>
<p>189
00:23:26.890 --&gt; 00:23:29.180
Ray Lutz: 0, 1, 2, 1 is 50.</p>
<p>190
00:23:29.430 --&gt; 00:23:33.750
Ray Lutz: So we multiplied those values together and put it in here. So it's all malleable.</p>
<p>191
00:23:33.890 --&gt; 00:23:38.526
Ray Lutz: You can do it like that just like a spreadsheet, and then</p>
<p>192
00:23:39.810 --&gt; 00:23:43.939
Ray Lutz: we can insert columns here. So we're going to put in a first.st So the</p>
<p>193
00:23:44.160 --&gt; 00:23:50.149
Ray Lutz: if you add a column like house, car and boat, and we call that category</p>
<p>194
00:23:52.740 --&gt; 00:23:59.119
Ray Lutz: then we also say we want to set the key field to category. Now, what it's done is</p>
<p>195
00:23:59.310 --&gt; 00:24:08.039
Ray Lutz: what it does, what it does. Is it one of the columns you can say that's going to be my key field, and then it puts it in that. That dictionary lookup!</p>
<p>196
00:24:08.200 --&gt; 00:24:09.600
Ray Lutz: Called the</p>
<p>197
00:24:11.840 --&gt; 00:24:20.049
Ray Lutz: Dk. Let's see, it's called HD, it's called a key key dictionary. So this is a dictionary lookup so super fast</p>
<p>198
00:24:20.260 --&gt; 00:24:21.729
Ray Lutz: if you have a long one.</p>
<p>199
00:24:22.600 --&gt; 00:24:30.250
Ray Lutz: but it has to be. If you do this, you can't have repeated values in here. It's gonna it's gonna hit the la, the 1st one that it sees.</p>
<p>200
00:24:31.774 --&gt; 00:24:37.559
Ray Lutz: And so here, what we did was we add additional records, and it's going to add them in there</p>
<p>201
00:24:37.760 --&gt; 00:24:38.690
Ray Lutz: with</p>
<p>202
00:24:40.260 --&gt; 00:24:45.879
Ray Lutz: Here, you see the category is in a different order, and it still puts it in</p>
<p>203
00:24:46.530 --&gt; 00:24:54.150
Ray Lutz: and if we have a double in there, it's going to modify the one that's there.</p>
<p>204
00:24:54.560 --&gt; 00:24:57.849
Ray Lutz: So if you have, if you index in and you say?</p>
<p>205
00:24:58.403 --&gt; 00:25:02.540
Ray Lutz: House, car, boat and house car boat, Mall Van Condo.</p>
<p>206
00:25:02.680 --&gt; 00:25:04.559
Ray Lutz: I think I have it in the next one.</p>
<p>207
00:25:05.330 --&gt; 00:25:15.550
Ray Lutz: where, if you say house and you give it new values. It's going to modify the one that's there. Okay, so it doesn't add another one called house.</p>
<p>208
00:25:16.650 --&gt; 00:25:23.520
Ray Lutz: Then you can select by using a select where statement.</p>
<p>209
00:25:23.980 --&gt; 00:25:34.650
Ray Lutz: This is where lambda statements are really useful, where you just say Lambda Row, and you say the row where the C value is greater than 20. I want to select those row those rows.</p>
<p>210
00:25:35.270 --&gt; 00:25:42.259
Ray Lutz: It makes a new daffodil table, but it doesn't make new rows.</p>
<p>211
00:25:42.640 --&gt; 00:25:47.410
Ray Lutz: These are actually the rows from this table just referenced over here.</p>
<p>212
00:25:48.110 --&gt; 00:25:53.340
Ray Lutz: So it uses a by reference, just like Pandas Python does all the time.</p>
<p>213
00:25:53.460 --&gt; 00:26:04.570
Ray Lutz: So you're not actually creating a new whole table. These are not unique values. These are actually the same list values from over here, put in over here so that you've just selected them.</p>
<p>214
00:26:04.770 --&gt; 00:26:10.030
Ray Lutz: And so this daffodil table only has a list of references to the same data.</p>
<p>215
00:26:10.310 --&gt; 00:26:15.149
Ray Lutz: all right, so that this way these selections are very fast because it doesn't do any copying</p>
<p>216
00:26:15.270 --&gt; 00:26:17.197
Ray Lutz: unless you wanted to.</p>
<p>217
00:26:17.820 --&gt; 00:26:18.740
Ray Lutz: Okay,</p>
<p>218
00:26:20.300 --&gt; 00:26:30.059
Ray Lutz: you can select a record by the key. You can also just do it this way. Put the key in to the indexing, and then say, you want it to be a dictionary.</p>
<p>219
00:26:30.380 --&gt; 00:26:35.700
Ray Lutz: Now, what we're going to end up doing is putting comma in here. Whoops can't click.</p>
<p>220
00:26:35.850 --&gt; 00:26:44.570
Ray Lutz: You put a comma in here and put a R type return type equals Dick inside here instead of having dot 2, Dick.</p>
<p>221
00:26:44.730 --&gt; 00:26:51.919
Ray Lutz: because it's handier to know, like in this mode, if you want it to be a list.</p>
<p>222
00:26:52.340 --&gt; 00:26:57.040
Ray Lutz: that's what is already in the array. The list. So if you want the list out.</p>
<p>223
00:26:57.220 --&gt; 00:26:59.189
Ray Lutz: You don't want to convert it to</p>
<p>224
00:27:00.940 --&gt; 00:27:09.149
Ray Lutz: Say a dictionary or a whole array, because you're going to get a whole array out of this selection. One row. But it's going to be a daffodil array data type.</p>
<p>225
00:27:11.580 --&gt; 00:27:21.490
Ray Lutz: so what you don't. If you want to get a list out of it. It's nice to know ahead of time. So, and we'll all show you that in a second, because there's another thing I want to show you, which is called a keyed list.</p>
<p>226
00:27:23.220 --&gt; 00:27:30.630
Ray Lutz: so you can get at different types out. If you have 2 deck, 2 list, 2 value, you can just print, or there's other things to numpy</p>
<p>227
00:27:31.125 --&gt; 00:27:34.909
Ray Lutz: to pandas. You know there's other things you can convert to here.</p>
<p>228
00:27:36.590 --&gt; 00:27:39.989
Ray Lutz: So a common usage pattern is to process things by row.</p>
<p>229
00:27:40.752 --&gt; 00:27:46.529
Ray Lutz: Where you would have somehow you're transforming the original row into a new row.</p>
<p>230
00:27:47.320 --&gt; 00:27:50.870
Ray Lutz: and then you append the new row to the new daffodil table.</p>
<p>231
00:27:51.000 --&gt; 00:28:09.870
Ray Lutz: Now, depending upon what the transform does, it might give you the same data again, with just something modified. It might mutate that row, and you would get it back here. When you append this, it's the same row as the original with a mutation. Guess what? That's going to modify the old row, so you don't necessarily want to do that. If you're making a mutation</p>
<p>232
00:28:13.400 --&gt; 00:28:19.369
Ray Lutz: and then you would append it to the new table.</p>
<p>233
00:28:19.590 --&gt; 00:28:27.799
Ray Lutz: and then you can do, you can put it out. It turns out you don't have to flatten. We've discovered later, and I want to show you that in a second it automatically flattens.</p>
<p>234
00:28:29.580 --&gt; 00:28:37.630
Ray Lutz: so you can just apply. So if you have a transform row function, you just say, apply the function and it applies it, row by row</p>
<p>235
00:28:37.780 --&gt; 00:28:41.289
Ray Lutz: and then gives you a new daffodil table. So you just go. You can do it this way.</p>
<p>236
00:28:42.380 --&gt; 00:28:54.709
Ray Lutz: And you can also then just apply the data types at the end. If you want like, you can read it in, apply the data types, apply the transform. You don't need to flatten it anymore. Because I'll show you why most of the time.</p>
<p>237
00:28:54.980 --&gt; 00:28:57.570
Ray Lutz: And then you just say to Csv and write it out.</p>
<p>238
00:28:57.960 --&gt; 00:29:01.169
Ray Lutz: So here's where you write it in. You transform, row by row.</p>
<p>239
00:29:02.940 --&gt; 00:29:14.319
Ray Lutz: if you're doing this, daffodil works really? Well, okay. And this same sort of transform can be applied to. I'll show you in a second, when we're expanding this to use SQL,</p>
<p>240
00:29:15.330 --&gt; 00:29:16.210
Ray Lutz: backing</p>
<p>241
00:29:18.610 --&gt; 00:29:24.210
Ray Lutz: so we avoid copies. This is what makes it faster way faster than pandas. Most of the time</p>
<p>242
00:29:24.390 --&gt; 00:29:26.660
Ray Lutz: pandas is is fast.</p>
<p>243
00:29:26.990 --&gt; 00:29:33.479
Ray Lutz: If you're doing those matrix manipulate not their array manipulations that are used in numpy.</p>
<p>244
00:29:33.810 --&gt; 00:29:40.909
Ray Lutz: But if you do stupid things like like add columns and add rows and pen things and stuff. It gets really, really slow.</p>
<p>245
00:29:41.040 --&gt; 00:29:48.260
Ray Lutz: And also when you're when you end up copying. So we're using references to existing data rather than recopying unless you want to</p>
<p>246
00:29:49.530 --&gt; 00:29:53.550
Ray Lutz: so row selections, reuses the existing Header Dictionary</p>
<p>247
00:29:53.870 --&gt; 00:30:00.470
Ray Lutz: and the selected list values from the source daffodil array, and then</p>
<p>248
00:30:00.730 --&gt; 00:30:10.269
Ray Lutz: processing by columns is slower, but you can usually avoid that. What you want to do is in one fell swoop. If you want to add columns and drop them.</p>
<p>249
00:30:11.090 --&gt; 00:30:12.809
Ray Lutz: You do that all at one time.</p>
<p>250
00:30:13.310 --&gt; 00:30:16.820
Ray Lutz: and, in fact, if you want to do 8 that a</p>
<p>251
00:30:19.150 --&gt; 00:30:20.690
Ray Lutz: if you want to flip, the</p>
<p>252
00:30:20.960 --&gt; 00:30:25.246
Ray Lutz: flip, the array on a on a diagonal, which is</p>
<p>253
00:30:26.710 --&gt; 00:30:29.919
Ray Lutz: Why can't I think of it? It starts with France. I can't think of a trance.</p>
<p>254
00:30:30.895 --&gt; 00:30:33.259
Ray Lutz: We'll get to that in a second, but it but the</p>
<p>255
00:30:34.496 --&gt; 00:30:43.109
Ray Lutz: daffodil is pretty slow with doing when you you know head to head when you're doing manipulations of</p>
<p>256
00:30:43.460 --&gt; 00:30:44.700
Ray Lutz: numerics.</p>
<p>257
00:30:45.190 --&gt; 00:30:51.100
Ray Lutz: But when you're doing this type of row selections, it's much faster</p>
<p>258
00:30:51.440 --&gt; 00:31:03.609
Ray Lutz: and column basing. Oh, it's transposition. If you say Flip is true, and you're doing adding rows or subtracting them. You can also flip it for free, because you have to make a whole new one.</p>
<p>259
00:31:04.720 --&gt; 00:31:12.470
Ray Lutz: so you can flip it for free, if you want to, when you're changing the columns, dropping them and adding them. But you want to do that all at one time.</p>
<p>260
00:31:13.170 --&gt; 00:31:14.359
Ray Lutz: Add and drop.</p>
<p>261
00:31:14.460 --&gt; 00:31:21.460
Ray Lutz: basically modify. The columns, end up with a new array that has the columns that you need, and then mutate it in place.</p>
<p>262
00:31:23.110 --&gt; 00:31:25.880
Ray Lutz: In other words don't add columns one at a time.</p>
<p>263
00:31:26.470 --&gt; 00:31:29.340
Ray Lutz: and because it's going to just be a lot of overhead.</p>
<p>264
00:31:31.280 --&gt; 00:31:37.400
Ray Lutz: But if you have columns in there, then you can just don't use the ones that you don't want to use. Okay, next thing.</p>
<p>265
00:31:37.790 --&gt; 00:31:44.809
Ray Lutz: So the keyed list is one of the core technologies that we developed inside this at once, we got some more experience.</p>
<p>266
00:31:45.500 --&gt; 00:31:51.710
Ray Lutz: So a key keyed list is basically, if you</p>
<p>267
00:31:51.870 --&gt; 00:31:58.130
Ray Lutz: take a if you do a zip of keys and values. This creates a conventional dictionary. So you have.</p>
<p>268
00:31:58.300 --&gt; 00:32:04.090
Ray Lutz: if you want to. Let's say you have a list and you have keys. You want to apply to a list values.</p>
<p>269
00:32:04.250 --&gt; 00:32:07.959
Ray Lutz: You have to go through this transformation, and this takes time.</p>
<p>270
00:32:08.570 --&gt; 00:32:12.000
Ray Lutz: It distributes the values to each item in the dictionary.</p>
<p>271
00:32:12.340 --&gt; 00:32:16.860
Ray Lutz: You create a dictionary with these keys, and then you put a value on each one.</p>
<p>272
00:32:17.100 --&gt; 00:32:25.010
Ray Lutz: and it's it's in memory. Now. All of a sudden the values are distributed out in this dictionary. You don't know what order they're entering some weird order now.</p>
<p>273
00:32:25.760 --&gt; 00:32:32.019
Ray Lutz: The dictionary takes care of making them in the same order, but in the actual dictionary itself. I don't know what order they're in.</p>
<p>274
00:32:32.610 --&gt; 00:32:34.240
Ray Lutz: They're not a list anymore.</p>
<p>275
00:32:34.360 --&gt; 00:32:35.690
Ray Lutz: Let's put it that way</p>
<p>276
00:32:36.480 --&gt; 00:32:42.069
Ray Lutz: so you can get the list out by saying, Dick. Dot values. You can get the list out</p>
<p>277
00:32:42.560 --&gt; 00:32:48.910
Ray Lutz: and you can get the keys out. It's it's not a list. At this point you'd have to convert it to a list. It's a keys.</p>
<p>278
00:32:49.290 --&gt; 00:32:51.120
Ray Lutz: It's a keys type oops.</p>
<p>279
00:32:51.740 --&gt; 00:32:56.009
Ray Lutz: So we propose this concept except of a keyed list</p>
<p>280
00:32:56.680 --&gt; 00:33:01.210
Ray Lutz: which contains a header dictionary that contains indexes for each key and</p>
<p>281
00:33:02.950 --&gt; 00:33:09.609
Ray Lutz: this is one way to create the header dictionary. And it's this is an easy way to understand it, but it's not the most optimal way to do it.</p>
<p>282
00:33:09.760 --&gt; 00:33:11.419
Ray Lutz: So you're going to have a column</p>
<p>283
00:33:11.540 --&gt; 00:33:23.539
Ray Lutz: name and the index for the index and the column for the enumeration of the keys. So this index is going to go. 0 1, 2, 3, 4, 5, and that's going to be the value. Right? You go through all the keys, and you put them up here.</p>
<p>284
00:33:23.640 --&gt; 00:33:28.090
Ray Lutz: So this stays the same for every</p>
<p>285
00:33:28.400 --&gt; 00:33:36.420
Ray Lutz: keyed list of the same size and with the same columns. You don't need a new header dictionary. You can use the same one for different keyed lists.</p>
<p>286
00:33:36.850 --&gt; 00:33:41.490
Ray Lutz: And the key. The list is</p>
<p>287
00:33:41.870 --&gt; 00:33:50.640
Ray Lutz: a list so, and like a regular dictionary that has, that distributes the values amongst all the keys in the structure.</p>
<p>288
00:33:50.950 --&gt; 00:33:53.400
Ray Lutz: The list is this is still a list.</p>
<p>289
00:33:54.930 --&gt; 00:34:02.499
Ray Lutz: It still looks like a dictionary. You have a key and a value, but it's structured, and you can still get the list out.</p>
<p>290
00:34:04.370 --&gt; 00:34:09.670
Ray Lutz: It looks like a dictionary, but it's not designed like a dictionary.</p>
<p>291
00:34:10.050 --&gt; 00:34:15.480
Ray Lutz: It has a header which has, excuse me</p>
<p>292
00:34:16.360 --&gt; 00:34:19.470
Ray Lutz: like a 0 b, 1 c, 2,</p>
<p>293
00:34:20.190 --&gt; 00:34:28.119
Ray Lutz: and then a list associated with that and and that this way, it's faster to to do things. So if you have a keyed list.</p>
<p>294
00:34:28.610 --&gt; 00:34:32.309
Ray Lutz: and like A is 34 B, 45, and C is 56,</p>
<p>295
00:34:33.159 --&gt; 00:34:36.180
Ray Lutz: and you have values here. 1, 2, 3.</p>
<p>296
00:34:37.460 --&gt; 00:34:42.620
Ray Lutz: You can say, the key list. Dot values equals. This value list, assign new values.</p>
<p>297
00:34:43.440 --&gt; 00:34:46.719
Ray Lutz: And now you have a new keyed list. With those values in there</p>
<p>298
00:34:47.530 --&gt; 00:34:52.000
Ray Lutz: you could do the same thing with the dictionary. It would put new values into the dictionary.</p>
<p>299
00:34:53.510 --&gt; 00:34:57.329
Ray Lutz: If you say, what is the value of B.</p>
<p>300
00:34:58.010 --&gt; 00:35:04.480
Ray Lutz: You know you're saying, I want to assign 67 to the to the B.</p>
<p>301
00:35:04.930 --&gt; 00:35:06.970
Ray Lutz: Now you have 67 here</p>
<p>302
00:35:07.450 --&gt; 00:35:11.410
Ray Lutz: the values list that you originally used. Sorry I can't click</p>
<p>303
00:35:11.540 --&gt; 00:35:17.100
Ray Lutz: the values list that you originally used also got changed because it's the same list.</p>
<p>304
00:35:17.460 --&gt; 00:35:18.729
Ray Lutz: When you said.</p>
<p>305
00:35:19.160 --&gt; 00:35:31.599
Ray Lutz: I want to use. Assign this values list to the keyed list. Dot values. It did not make a new list. It did not recopy anything. All it did is add a reference in here to this existing list.</p>
<p>306
00:35:33.840 --&gt; 00:35:41.649
Ray Lutz: and then the values list is the key list. Dot values output. True.</p>
<p>307
00:35:41.780 --&gt; 00:35:45.139
Ray Lutz: the is function means it is exactly the same thing.</p>
<p>308
00:35:45.900 --&gt; 00:35:49.550
Ray Lutz: It's the same thing in memory. There's no new version of it.</p>
<p>309
00:35:50.370 --&gt; 00:35:55.670
Ray Lutz: So a keyed list means that we can</p>
<p>310
00:35:57.003 --&gt; 00:36:03.320
Ray Lutz: number one. If you have a list, and you want to put it into your daffodil array.</p>
<p>311
00:36:04.780 --&gt; 00:36:11.099
Ray Lutz: Don't turn it into addiction like, if you have a list. It goes directly into that list in the array.</p>
<p>312
00:36:11.430 --&gt; 00:36:15.050
Ray Lutz: Now, if you want to iterate through the daffodil array.</p>
<p>313
00:36:15.650 --&gt; 00:36:20.339
Ray Lutz: it's convenient to iterate through with keyed lists, because, if you modify one.</p>
<p>314
00:36:20.450 --&gt; 00:36:25.580
Ray Lutz: it actually modifies the array without having to recopy it in just the way a</p>
<p>315
00:36:25.730 --&gt; 00:36:28.430
Ray Lutz: like. If you had a list of dictionaries</p>
<p>316
00:36:28.710 --&gt; 00:36:35.689
Ray Lutz: and you go through the list of dictionaries, and you you have a dictionary in hand, and you change that item.</p>
<p>317
00:36:36.690 --&gt; 00:36:44.360
Ray Lutz: It actually is the same dictionary as in the main ray of list of dictionaries, and it'll change it in the list of dictionaries.</p>
<p>318
00:36:44.560 --&gt; 00:36:50.450
Ray Lutz: Now, if you have a daffodil array and you pull a dictionary out.</p>
<p>319
00:36:50.830 --&gt; 00:36:56.150
Ray Lutz: It's not the same data as what's in the array, and if you change it, it doesn't change what's in the array.</p>
<p>320
00:36:56.510 --&gt; 00:36:58.570
Ray Lutz: But if you take a keyed list out.</p>
<p>321
00:36:58.900 --&gt; 00:37:07.050
Ray Lutz: and you change that item in that list. That list is the same one that's in the array, and you've changed it without having to recopy it back in. So then, it works the same way as dictionaries do.</p>
<p>322
00:37:07.830 --&gt; 00:37:11.330
Ray Lutz: I don't know if I I probably can add another slide for that to explain it.</p>
<p>323
00:37:12.240 --&gt; 00:37:14.490
Ray Lutz: Now we're going to go to a new topic.</p>
<p>324
00:37:14.940 --&gt; 00:37:18.390
Ray Lutz: Csv, reading very, very fast. If you have</p>
<p>325
00:37:21.110 --&gt; 00:37:25.429
Ray Lutz: as you stay in string type, so as long as you don't convert anything.</p>
<p>326
00:37:26.140 --&gt; 00:37:30.528
Ray Lutz: the python reader is really fast</p>
<p>327
00:37:31.580 --&gt; 00:37:39.080
Ray Lutz: for a million rows. According to this guy here. This reference he timed it. I'm not sure I trust this, but anyway, I used it because it was a reference.</p>
<p>328
00:37:39.360 --&gt; 00:37:44.210
Ray Lutz: and if you do a Pandas read Csv. It's much more time.</p>
<p>329
00:37:46.960 --&gt; 00:37:51.909
Ray Lutz: Pandas, read Csv with a chunk size is for some reason worse.</p>
<p>330
00:37:52.080 --&gt; 00:37:55.340
Ray Lutz: Dask, worst data frame</p>
<p>331
00:37:55.630 --&gt; 00:38:04.049
Ray Lutz: a data table, I guess, is another option. It's not as fast. This looks like absurdly</p>
<p>332
00:38:04.280 --&gt; 00:38:15.620
Ray Lutz: way better than it really is, so I'll have to look into that. But it's still very, very fast, because it doesn't do any type conversion for you and pandas does this automatically try to be</p>
<p>333
00:38:15.780 --&gt; 00:38:17.700
Ray Lutz: be easy to use.</p>
<p>334
00:38:17.830 --&gt; 00:38:20.650
Ray Lutz: But if you don't want that, it doesn't, doesn't happen</p>
<p>335
00:38:21.290 --&gt; 00:38:26.700
Ray Lutz: so later you can apply the D types and unflatten.</p>
<p>336
00:38:27.230 --&gt; 00:38:36.669
Ray Lutz: unflatten them, which would would bring them up into become a data python data type, such as like, if you have a dictionary in a cell.</p>
<p>337
00:38:37.170 --&gt; 00:38:42.340
Ray Lutz: and it gets turned into what either Json or what I call pyon.</p>
<p>338
00:38:42.490 --&gt; 00:38:44.270
Ray Lutz: which we're going to get into in a second.</p>
<p>339
00:38:44.710 --&gt; 00:38:50.050
Ray Lutz: then it will reform that into the dictionary within the cell.</p>
<p>340
00:38:52.220 --&gt; 00:38:54.230
Ray Lutz: Now, Csv. Writer.</p>
<p>341
00:38:54.690 --&gt; 00:39:02.520
Ray Lutz: right flattens automatically to pion. I didn't know this pion is something that I dreamed up as a name.</p>
<p>342
00:39:03.140 --&gt; 00:39:07.729
Ray Lutz: It means python object notation. And it's similar to Json.</p>
<p>343
00:39:08.630 --&gt; 00:39:14.329
Ray Lutz: It's actually a superset of Json Javascript. Object notation</p>
<p>344
00:39:14.520 --&gt; 00:39:18.730
Ray Lutz: is Json. And this is simply python object notation.</p>
<p>345
00:39:18.920 --&gt; 00:39:25.750
Ray Lutz: but it can express sets, tuples, dicks, lists, functions, etc. So we can do everything</p>
<p>346
00:39:26.633 --&gt; 00:39:40.640
Ray Lutz: within python, and it already does for the most part, except for functions. It'll create sets, Tuples, diction lists automatically, and a Csv writer without any you doing anything.</p>
<p>347
00:39:41.710 --&gt; 00:39:43.569
Ray Lutz: I just stumbled across this.</p>
<p>348
00:39:44.205 --&gt; 00:39:50.850
Ray Lutz: Now Pyon already exists. It's already defined. It's what you get if you rep, or something.</p>
<p>349
00:39:53.110 --&gt; 00:39:54.560
Ray Lutz: Generally speaking.</p>
<p>350
00:39:54.840 --&gt; 00:40:02.690
Ray Lutz: not always, because sometimes the wrappers are broken in in these things, but that they should do is define this as being</p>
<p>351
00:40:03.190 --&gt; 00:40:08.180
Ray Lutz: a Csv writer should use Reper, and sometimes it uses the Str function instead.</p>
<p>352
00:40:11.540 --&gt; 00:40:21.019
Ray Lutz: it's better than Pickle, Json, Pickle, and other variants of Json for working with python types, in my opinion.</p>
<p>353
00:40:21.940 --&gt; 00:40:25.030
Ray Lutz: So I generated this pyon pyon tools.</p>
<p>354
00:40:25.210 --&gt; 00:40:30.269
Ray Lutz: python module. It isn't quite published yet, but I'm using it myself.</p>
<p>355
00:40:30.630 --&gt; 00:40:37.289
Ray Lutz: and it turns out, Csv, but it's very simple, because what you're doing is you're using the wrapper method for any object.</p>
<p>356
00:40:37.440 --&gt; 00:40:40.980
Ray Lutz: and it basically does it already.</p>
<p>357
00:40:41.090 --&gt; 00:40:48.960
Ray Lutz: But when you're using Csv writer, you don't have to change this. So the way I stumbled across. This is, I had dictionaries</p>
<p>358
00:40:49.170 --&gt; 00:40:54.449
Ray Lutz: in, and my daffodil array. I wrote it out to a file.</p>
<p>359
00:40:54.600 --&gt; 00:41:01.139
Ray Lutz: and it automatically converted them and flattened them out into character strings, the normal ones that you see</p>
<p>360
00:41:02.866 --&gt; 00:41:13.269
Ray Lutz: when you look at when you look at a dictionary like we just were looking at some that look just like when you look at this, this dictionary right here</p>
<p>361
00:41:13.420 --&gt; 00:41:21.259
Ray Lutz: opens bracket single quote, a single quote, Colon 0, comma. All that sort of thing.</p>
<p>362
00:41:21.610 --&gt; 00:41:31.939
Ray Lutz: This is the expression, a string expression that represents a dictionary. It isn't the dictionary itself. Dictionary itself is some other, you know, thing in memory, and</p>
<p>363
00:41:32.100 --&gt; 00:41:42.130
Ray Lutz: of a fairly complex structure that python has suppressed. And what you understand as a dictionary, are these symbols right here?</p>
<p>364
00:41:42.430 --&gt; 00:41:52.550
Ray Lutz: Those symbols are character strings that can be represented in a file. So this is what you get. If you have a dictionary, which is this header dictionary with those things in it. That's exactly what you find in the</p>
<p>365
00:41:52.940 --&gt; 00:41:54.910
Ray Lutz: and the and the Csv file</p>
<p>366
00:41:55.290 --&gt; 00:42:04.269
Ray Lutz: this right here. Unfortunately, he doesn't use double quotes. So it's not exactly Json. If they allowed you to say, use double quotes instead. This would be Json.</p>
<p>367
00:42:04.920 --&gt; 00:42:11.979
Ray Lutz: and then you could use it with other tools, a little bit of a ripple there with what they use in python.</p>
<p>368
00:42:12.100 --&gt; 00:42:14.969
Ray Lutz: and maybe we can get Csv. Writer to</p>
<p>369
00:42:15.140 --&gt; 00:42:19.120
Ray Lutz: optionally. Use double quotes, so would still be valid Pyon.</p>
<p>370
00:42:19.750 --&gt; 00:42:23.499
Ray Lutz: but also meets the subset of Json.</p>
<p>371
00:42:24.730 --&gt; 00:42:27.319
Ray Lutz: I think the Python community should embrace</p>
<p>372
00:42:27.550 --&gt; 00:42:30.940
Ray Lutz: the pyon that they've already defined, but they don't have a name for it</p>
<p>373
00:42:31.270 --&gt; 00:42:38.240
Ray Lutz: and provide options for Csv. Writer to use double quotes and stuff in there, because then it would produce Json.</p>
<p>374
00:42:38.570 --&gt; 00:42:43.040
Ray Lutz: But this is how we flatten things from Daffodil. We do almost do nothing.</p>
<p>375
00:42:43.160 --&gt; 00:42:45.529
Ray Lutz: Python already does it for us.</p>
<p>376
00:42:46.340 --&gt; 00:42:53.050
Ray Lutz: Now, when we import Csv. We do it in a very controlled, explicit manner. So it comes in as strings.</p>
<p>377
00:42:54.110 --&gt; 00:43:08.590
Ray Lutz: and then we convert them. Unfortunately, pandas is optimized for tables with just numerics and simple header normal for Csv, and they don't. It's hard to work around this. You can do it, but it's just a pain in the ass to try to get it to, to do weird things.</p>
<p>378
00:43:10.230 --&gt; 00:43:13.730
Ray Lutz: So what we do is it dot</p>
<p>379
00:43:14.000 --&gt; 00:43:21.190
Ray Lutz: daffodil d types, which is something you can specify, and it doesn't do anything. It just gets carried around in the in the frame.</p>
<p>380
00:43:21.510 --&gt; 00:43:23.669
Ray Lutz: But if you</p>
<p>381
00:43:23.910 --&gt; 00:43:33.459
Ray Lutz: are importing things you can say, apply it, and then it will apply d types to the columns that you want to apply to. If the columns don't exist. It's not going to hurt it.</p>
<p>382
00:43:33.620 --&gt; 00:43:35.420
Ray Lutz: You if you drop them.</p>
<p>383
00:43:37.130 --&gt; 00:43:41.140
Ray Lutz: You don't want to apply d types to any columns you're not going to actually use.</p>
<p>384
00:43:41.410 --&gt; 00:43:50.979
Ray Lutz: So if you bring in an array, and it's got 5,000 columns, you only need 3 of them. First, st drop everything else that you don't need, or just work on the ones that you want to work on.</p>
<p>385
00:43:51.120 --&gt; 00:44:01.939
Ray Lutz: In fact, you don't need to drop them if you brought it all the way in. Just work on the columns that you want to work with, and then just ignore the rest. As soon as you start converting the thing, then you're starting to add time.</p>
<p>386
00:44:04.110 --&gt; 00:44:06.869
Ray Lutz: So we have a few other features. I want to mention.</p>
<p>387
00:44:07.450 --&gt; 00:44:10.200
Ray Lutz: number one. We have an indirect functionality</p>
<p>388
00:44:10.490 --&gt; 00:44:15.199
Ray Lutz: where a Dick can specify the contents of specific columns. So inside of a cell</p>
<p>389
00:44:15.930 --&gt; 00:44:23.250
Ray Lutz: you have a dictionary, and that dictionary actually specifies column names and values</p>
<p>390
00:44:23.510 --&gt; 00:44:40.629
Ray Lutz: which are to be interpreted as part of the actual array. But it's it's got an indirection. So you 1st go into the cell, you find out what's specified there and then that's to be interpreted as the rest of the array, and this is useful for sparse arrays. As I was saying, what I was working with</p>
<p>391
00:44:40.760 --&gt; 00:44:44.310
Ray Lutz: was very sparse. Away with like 5,600 columns.</p>
<p>392
00:44:44.740 --&gt; 00:44:48.410
Ray Lutz: and only about 50 of them are used at any one time.</p>
<p>393
00:44:48.960 --&gt; 00:44:51.849
Ray Lutz: So if you use, if you represent this as A</p>
<p>394
00:44:51.990 --&gt; 00:44:55.661
Ray Lutz: as a actual Csv file or anybody like that.</p>
<p>395
00:44:57.050 --&gt; 00:45:02.380
Ray Lutz: It's very, very costly, because you have all these commas right, comma comma comma comma.</p>
<p>396
00:45:02.530 --&gt; 00:45:12.309
Ray Lutz: and to represent all of the 5,600 columns when you're only going to use 50, and then you have them in there, and you got to try to figure out which ones they are. It's a mess. So</p>
<p>397
00:45:12.850 --&gt; 00:45:23.699
Ray Lutz: in this case, even though you want the array to be logically 5,600 columns for any one row. You don't want to have to specify more than just the 50 columns that you're working with.</p>
<p>398
00:45:24.380 --&gt; 00:45:37.019
Ray Lutz: And so in that one cell, what you have is a dictionary which specifies all of the columns that you're working with, and then it is logically considered, part of the array. So if you sum the column, the rows.</p>
<p>399
00:45:37.340 --&gt; 00:45:41.520
Ray Lutz: it figures out where those it takes that indirection into account.</p>
<p>400
00:45:41.670 --&gt; 00:45:45.160
Ray Lutz: expands them and works with summing them that way.</p>
<p>401
00:45:46.600 --&gt; 00:45:57.230
Ray Lutz: We have a from Pdf that will take a Pdf. File with a you know, header and columns and and parse it.</p>
<p>402
00:45:57.590 --&gt; 00:46:06.050
Ray Lutz: usually skipping a few things. You can use a few controls there to skip things. But to just convert from basic Pdf files that you might find</p>
<p>403
00:46:06.692 --&gt; 00:46:11.399
Ray Lutz: a little shortcut, you can do it yourself, but this shortcuts some work.</p>
<p>404
00:46:11.700 --&gt; 00:46:19.234
Ray Lutz: It offers the adders attributes, attrs. Dictionary as part of the</p>
<p>405
00:46:21.550 --&gt; 00:46:26.399
Ray Lutz: part of the class, for an instance, has this, and this is the same as in Pandas.</p>
<p>406
00:46:26.620 --&gt; 00:46:35.670
Ray Lutz: You can add any day any kind of attribute you want to a data frame and</p>
<p>407
00:46:35.950 --&gt; 00:46:44.030
Ray Lutz: what I find is convenient. There's is like, if I've already figured out like these are the metadata columns. They're all strings, and the rest of it is data.</p>
<p>408
00:46:45.090 --&gt; 00:46:54.610
Ray Lutz: Once I parse that and I know where they are, I need to pass that along and say, these are the metadata columns. This is the number of columns that's the metadata</p>
<p>409
00:46:55.215 --&gt; 00:47:00.590
Ray Lutz: and that's easy to do. You just put that into this adders, and then</p>
<p>410
00:47:00.810 --&gt; 00:47:14.399
Ray Lutz: your next function says, Well, how many metadata columns are there? Oh, it's an address. I already know that now you have to know that it's in there yet, you know, but at least you don't have to pass another variable along to say here's or recalculate it even worse.</p>
<p>411
00:47:14.590 --&gt; 00:47:21.159
Ray Lutz: So so that's something that turns out pandas had. And we just and using that the same way.</p>
<p>412
00:47:21.590 --&gt; 00:47:27.490
Ray Lutz: And then we're now offering a join method that is efficient and mimics a join in. SQL.</p>
<p>413
00:47:29.430 --&gt; 00:47:35.409
Ray Lutz: pandas doesn't have a real join, it has a merge. So when you do a merge in pandas you.</p>
<p>414
00:47:36.910 --&gt; 00:47:44.299
Ray Lutz: You do this, you essentially are doing what this does here. But if when we're</p>
<p>415
00:47:44.440 --&gt; 00:47:46.030
Ray Lutz: and I'll get to this in a second.</p>
<p>416
00:47:46.590 --&gt; 00:47:55.179
Ray Lutz: we're extending daffodil to use SQL. In the background. And when the SQL. Machine does a join.</p>
<p>417
00:47:55.290 --&gt; 00:47:59.290
Ray Lutz: it doesn't actually do anything, it actually just keeps track of the joint.</p>
<p>418
00:47:59.720 --&gt; 00:48:19.770
Ray Lutz: And if you say I want to take these columns in this table, and I want to join them with these columns in this table along this key, it doesn't do anything. It just remembers that. And if you say I also want to join this table in this table, in this table and this on these keys. You could do them one at a time, and you can do up to, I think, 64 times, or maybe it's 16. But there's a certain limit.</p>
<p>419
00:48:19.910 --&gt; 00:48:22.240
Ray Lutz: and then, once you have all your joints done.</p>
<p>420
00:48:22.720 --&gt; 00:48:31.590
Ray Lutz: and you say I want to select these column, these rows out of my join tables. Then it does it. Then it figures it out and it pulls all the data in and does it.</p>
<p>421
00:48:31.980 --&gt; 00:48:32.810
Ray Lutz: Okay.</p>
<p>422
00:48:32.960 --&gt; 00:48:51.309
Ray Lutz: so it's nice that way that SQL, when it does a join, it creates a view. It doesn't actually create a it doesn't actually do anything. Now, pandas always does something. It always does a merge. It takes data from one thing, it puts it with this one and merges it together. Essentially, that's what this join does. But</p>
<p>423
00:48:51.470 --&gt; 00:48:57.639
Ray Lutz: this one is, we'll be using the the SQL. Type of join when we get to that.</p>
<p>424
00:48:58.010 --&gt; 00:49:00.029
Ray Lutz: So there's the plans for SQL.</p>
<p>425
00:49:01.270 --&gt; 00:49:10.500
Ray Lutz: Now, right now, daffodilla rays must fit into memory at this time. So so if it doesn't fit into memory, you're going to have to chunk it</p>
<p>426
00:49:13.440 --&gt; 00:49:29.729
Ray Lutz: which we do. So we have a thing where we we chunk things and there's a lot of infrastructure. I might add a daffodil that are that that chunks things. So so basically, we, we have a chunk of like a hundred things in one chunk, and we have thousands of those</p>
<p>427
00:49:29.980 --&gt; 00:49:37.310
Ray Lutz: we don't actually want to combine them necessarily upfront. We combine them all at one fell swoop and then make one big file.</p>
<p>428
00:49:37.440 --&gt; 00:49:39.339
Ray Lutz: or just work with the chunks.</p>
<p>429
00:49:40.470 --&gt; 00:49:50.440
Ray Lutz: What we'll do here with SQL is use, and since the row-based daffodil arrays are similar to SQL. Data tables. They're also row based.</p>
<p>430
00:49:51.120 --&gt; 00:49:53.339
Ray Lutz: But they have column operations. Of course.</p>
<p>431
00:49:53.860 --&gt; 00:50:14.559
Ray Lutz: we'll add a quarks. We'll add additional keyword arguments in the indexing to specify whether it will be an SQL. Table or another way to say it is just you take the original daffodil.to SQL. And we'll give it a name, and then we'll get this daffodil table main underscore. SQL. Daff, we'll just call it that.</p>
<p>432
00:50:14.760 --&gt; 00:50:18.009
Ray Lutz: You don't have to use this name, and that will be</p>
<p>433
00:50:18.800 --&gt; 00:50:21.019
Ray Lutz: how we refer to it within python.</p>
<p>434
00:50:21.360 --&gt; 00:50:25.990
Ray Lutz: And this actually will look like a daffodil table. But it's actually in SQL,</p>
<p>435
00:50:26.190 --&gt; 00:50:31.890
Ray Lutz: so we don't actually have the table in daffodil. It's basically a proxy to the actual table.</p>
<p>436
00:50:32.610 --&gt; 00:50:39.430
Ray Lutz: Then operations on SQL. Def. Will operate as if the table were in memory. But it actually is operating SQL. Engine.</p>
<p>437
00:50:39.890 --&gt; 00:50:40.880
Ray Lutz: and</p>
<p>438
00:50:41.570 --&gt; 00:50:55.309
Ray Lutz: the results is, we can allow, much larger tables while still manipulating the daffodil array paradigm with selection and indexing done in a pythonic way. So essentially, we're still going to use those square brackets, you know. The 1st one is the row.</p>
<p>439
00:50:55.420 --&gt; 00:51:03.800
Ray Lutz: Well, that's like select, you know. The second one is column. So select, star. That would be kind of like the first.st The second thing, the columns that you want.</p>
<p>440
00:51:03.980 --&gt; 00:51:11.139
Ray Lutz: and then you say which things you want to select in a in a SQL. Statement that would be the 1st</p>
<p>441
00:51:11.370 --&gt; 00:51:15.381
Ray Lutz: parameter, and selecting it as and and the like. So</p>
<p>442
00:51:16.730 --&gt; 00:51:27.859
Ray Lutz: I won't go into some of the difficulties that we found in sqlite. But sqlite does not have a row based like a vector based operation so that we can have.</p>
<p>443
00:51:31.860 --&gt; 00:51:34.319
Ray Lutz: We can have python</p>
<p>444
00:51:34.530 --&gt; 00:51:47.540
Ray Lutz: an apply that would take, say, a row from the table, run it through python, and return on entire row, and then add it to a new table. It doesn't provide that in SQL lite that would be an extension we'd want to see</p>
<p>445
00:51:47.998 --&gt; 00:51:55.519
Ray Lutz: all they allow is Scalar returns, and then it's a lot of work to do it, so it's better to bring a chunk out of the table</p>
<p>446
00:51:55.850 --&gt; 00:52:01.129
Ray Lutz: as a daffodil array. Apply it within python, and then move it back into</p>
<p>447
00:52:01.500 --&gt; 00:52:05.320
Ray Lutz: the SQL. Right? That's the best way to do it right now. It's the fastest.</p>
<p>448
00:52:07.336 --&gt; 00:52:10.609
Ray Lutz: So once, if you want to use</p>
<p>449
00:52:11.560 --&gt; 00:52:14.560
Ray Lutz: we would also support general SQL. Queries</p>
<p>450
00:52:14.810 --&gt; 00:52:20.530
Ray Lutz: and and the proxy. So if you say I want to do, I want to actually use this SQL, query.</p>
<p>451
00:52:21.110 --&gt; 00:52:30.490
Ray Lutz: Then I keep doing that click can't click when I'm in this. So if you want to have an SQL. Query and apply it to this, this proxy.</p>
<p>452
00:52:31.860 --&gt; 00:52:34.610
Ray Lutz: We don't know the name of the table over there, necessarily.</p>
<p>453
00:52:34.800 --&gt; 00:52:38.880
Ray Lutz: And one thing about python is you.</p>
<p>454
00:52:39.010 --&gt; 00:52:47.389
Ray Lutz: When you create a daffodil table. You don't know what name you're going to apply to it, because it's not the way Python works. It doesn't know what name it has. In fact, it could have many names.</p>
<p>455
00:52:49.910 --&gt; 00:52:53.390
Ray Lutz: In SQL. When you have a table it has a name.</p>
<p>456
00:52:53.530 --&gt; 00:52:56.559
Ray Lutz: and you have to use that name to refer to it all the time.</p>
<p>457
00:52:58.870 --&gt; 00:53:05.299
Ray Lutz: so we don't. We're gonna be have to name our table with some arbitrary name, if you don't give it one.</p>
<p>458
00:53:05.670 --&gt; 00:53:06.899
Ray Lutz: And then</p>
<p>459
00:53:07.660 --&gt; 00:53:13.269
Ray Lutz: or we may actually always name it with arbitrary name and then map it over. But essentially.</p>
<p>460
00:53:14.780 --&gt; 00:53:18.910
Ray Lutz: when you do an SQL statement.</p>
<p>461
00:53:19.450 --&gt; 00:53:22.219
Ray Lutz: and you say, I want to</p>
<p>462
00:53:22.340 --&gt; 00:53:25.069
Ray Lutz: like, select blah blah blah from</p>
<p>463
00:53:25.770 --&gt; 00:53:27.700
Ray Lutz: you have to put in a table name.</p>
<p>464
00:53:27.960 --&gt; 00:53:39.890
Ray Lutz: Okay? And you're not going to know what that name is. And so that's why we're going to have to have some substitution going on with that. And that won't be too hard for people to do if they want to use general purpose. Queries</p>
<p>465
00:53:42.040 --&gt; 00:53:47.979
Ray Lutz: pretty much. Everything within pandas is repeat, is available within daffodil</p>
<p>466
00:53:49.160 --&gt; 00:53:55.049
Ray Lutz: But some things that are not available in pandas are available like we can do append which has been deprecated.</p>
<p>467
00:53:58.270 --&gt; 00:54:19.940
Ray Lutz: but most things run the same way. A little bit different, because pandas is normally columns of numpy arrays, and so, if you don't, and so, if you, the 1st value in the square brackets is a column by default, whereas in our mode the 1st thing by default is the row just to be aware of that</p>
<p>468
00:54:23.060 --&gt; 00:54:28.340
Ray Lutz: and so pretty much they're all there. And again the timing we went over briefly at the beginning.</p>
<p>469
00:54:30.730 --&gt; 00:54:38.663
Ray Lutz: Daffodil is faster for array, manipulation like appending rows. But pandas is faster. If you're going to do column based</p>
<p>470
00:54:39.590 --&gt; 00:54:41.020
Ray Lutz: manipulations.</p>
<p>471
00:54:41.300 --&gt; 00:54:45.950
Ray Lutz: Basically. Here was my summary about use cases, and when you want to use each one.</p>
<p>472
00:54:46.160 --&gt; 00:54:50.950
Ray Lutz: if you have existing data in well-defined, column-based format, then.</p>
<p>473
00:54:51.820 --&gt; 00:54:54.970
Ray Lutz: and almost all data is numeric.</p>
<p>474
00:54:55.760 --&gt; 00:55:01.139
Ray Lutz: and you don't want to do appending or ponification other than maybe creating some additional columns.</p>
<p>475
00:55:02.210 --&gt; 00:55:09.709
Ray Lutz: and then maybe produce plots after you analyze it, and and so forth. Then pandas might be the best choice for sure.</p>
<p>476
00:55:11.190 --&gt; 00:55:13.289
Ray Lutz: as long as the data fits in memory.</p>
<p>477
00:55:13.980 --&gt; 00:55:24.469
Ray Lutz: once it gets out of memory, then maybe you're going to use Daffodil, SQL. Might be a good choice. We'll have to see that hasn't really been haven't really tested that enough to know if that's going to be a better choice for you.</p>
<p>478
00:55:25.040 --&gt; 00:55:30.090
Ray Lutz: If you're building data tables by analyzing converting images or other data penning to a table</p>
<p>479
00:55:30.250 --&gt; 00:55:44.499
Ray Lutz: that is not going to be pandas. If you want to have small utility tables used for tracking processes or parsing data, driving state machines, all these kind of little tables you might use all the time throughout your code.</p>
<p>480
00:55:44.970 --&gt; 00:55:51.520
Ray Lutz: Use daffodil tables. Don't get involved with pandas. That's for data analysis. And those specific things.</p>
<p>481
00:55:53.091 --&gt; 00:55:55.550
Ray Lutz: Once you build the table.</p>
<p>482
00:55:55.670 --&gt; 00:55:58.259
Ray Lutz: Then you might want to use pandas or numpy.</p>
<p>483
00:55:58.860 --&gt; 00:56:07.090
Ray Lutz: We can convert individual columns, and this is pretty good way to do it. Convert it to numpy arrays so that the columns can be managed.</p>
<p>484
00:56:07.886 --&gt; 00:56:13.330
Ray Lutz: There can be, you know, some, for example, like, if you sum 2 columns, you get another whole column.</p>
<p>485
00:56:13.760 --&gt; 00:56:33.910
Ray Lutz: This kind of operation here will work. If it's a numpy array, or you can multiply columns, you can do all of the functions, add them together multiply by a scalar. You can all of these functions here, or say, divide one column by another one. It creates another whole column, and and that expression is very fast.</p>
<p>486
00:56:34.480 --&gt; 00:56:42.450
Ray Lutz: So you can do this by just having a dictionary of numpy arrays converted on the columns that you want to use.</p>
<p>487
00:56:43.210 --&gt; 00:56:55.069
Ray Lutz: If you want to do state updates like, like tracking the state of a user in a web-based application or something. Then you're going to want to use an SQL or no SQL. Or something kind of database, and not use any of these. Of course.</p>
<p>488
00:56:55.837 --&gt; 00:57:03.530
Ray Lutz: Statuses that you know I've used it quite a bit myself. It hasn't really been adopted very much. That's okay. I mean, we're still</p>
<p>489
00:57:03.690 --&gt; 00:57:06.800
Ray Lutz: sort of researching the best ways for this to work.</p>
<p>490
00:57:07.398 --&gt; 00:57:12.489
Ray Lutz: I've used it myself. I've convert almost everything over from Pandas, and and I love it.</p>
<p>491
00:57:15.100 --&gt; 00:57:22.539
Ray Lutz: and a couple of cases I still have to use. SQL. Because the tables got big. And so that's why I want to convert over to Daffodil. SQL,</p>
<p>492
00:57:23.301 --&gt; 00:57:25.869
Ray Lutz: and that's pretty much what I had there.</p>
<p>493
00:57:26.330 --&gt; 00:57:30.270
Ray Lutz: Okay, so I'm done. Guess</p>
<p>494
00:57:30.410 --&gt; 00:57:35.550
Ray Lutz: my contact is Ray <a href="mailto:lutz@cognizys.com">lutz@cognizys.com</a>, or you can. That's my email.</p>
<p>495
00:57:37.210 --&gt; 00:57:42.549
Ray Lutz: Any questions. I guess I I used up the whole hour a little bit more and not</p>
<p>496
00:57:43.430 --&gt; 00:57:49.000
Ray Lutz: gave room for many questions. I see a chat room is okay. I'll leave.</p>
<p>497
00:57:49.000 --&gt; 00:57:51.159
Gabor Szabo: I think apologies, no worries.</p>
<p>498
00:57:51.160 --&gt; 00:57:51.800
Gabor Szabo: Don't do that.</p>
<p>499
00:57:52.590 --&gt; 00:57:55.050
Gabor Szabo: If anyone has questions, then then please do ask.</p>
<p>500
00:57:55.200 --&gt; 00:57:59.460
Gabor Szabo: I just wanted to say something. 1st of all. Thank you very much for the presentation.</p>
<p>501
00:57:59.750 --&gt; 00:58:04.290
Gabor Szabo: But one thing that is sort of related.</p>
<p>502
00:58:04.970 --&gt; 00:58:17.130
Gabor Szabo: that I see many people using pandas for, for I just read in Csv file and do some simple manipulation, and they always go to Pandas, because that's what they learned.</p>
<p>503
00:58:17.470 --&gt; 00:58:21.289
Gabor Szabo: and they don't use the Standard Csv Library in.</p>
<p>504
00:58:21.805 --&gt; 00:58:22.320
Ray Lutz: Yeah.</p>
<p>505
00:58:22.600 --&gt; 00:58:33.609
Gabor Szabo: And and I had a feeling that pandas is just way too big for this. But now your your numbers show that it's way slower than than using the Standard Csv Library.</p>
<p>506
00:58:34.620 --&gt; 00:58:38.569
Ray Lutz: Way slower and also just getting in and out of it.</p>
<p>507
00:58:39.090 --&gt; 00:58:43.039
Ray Lutz: like, if you're if you're just staying within the Pandas world.</p>
<p>508
00:58:43.280 --&gt; 00:58:46.079
Ray Lutz: and you're doing stuff that are pandas related.</p>
<p>509
00:58:47.080 --&gt; 00:58:48.270
Ray Lutz: It's great.</p>
<p>510
00:58:48.440 --&gt; 00:58:52.260
Ray Lutz: And and I think, though, that what we're going to find</p>
<p>511
00:58:52.400 --&gt; 00:59:02.439
Ray Lutz: is that using a daffodil array and converting it to a dictionary of numpy arrays, which is kind of what's inside pandas. But pandas has grown so big</p>
<p>512
00:59:02.780 --&gt; 00:59:10.259
Ray Lutz: that I've watched it load. It takes several seconds, maybe like 5, 10 seconds for it just to be imported.</p>
<p>513
00:59:10.600 --&gt; 00:59:15.820
Ray Lutz: So when you're running one of these interpreters and you're using a huge library like Pandas.</p>
<p>514
00:59:16.550 --&gt; 00:59:19.070
Ray Lutz: I mean the main Panda's class.</p>
<p>515
00:59:19.310 --&gt; 00:59:23.379
Ray Lutz: Just one class is like 13,000 lines.</p>
<p>516
00:59:23.690 --&gt; 00:59:28.290
Ray Lutz: It's a real. It's all in one file, I mean, I'm really surprised. They still write them this way.</p>
<p>517
00:59:28.400 --&gt; 00:59:35.630
Ray Lutz: but it's it's a very highly functional thing. And here's the thing is that people are nowadays.</p>
<p>518
00:59:36.320 --&gt; 00:59:45.059
Ray Lutz: They might be using an AI machine to assist them, and they say, You know, read this in, and and you know, do a few conversions and then put out, help me do this plot.</p>
<p>519
00:59:45.770 --&gt; 00:59:52.860
Ray Lutz: The AI machines. They know perfectly well how to use pandas, and they'll do it now, for that</p>
<p>520
00:59:53.260 --&gt; 00:59:59.179
Ray Lutz: efficiency is not important, really. You may wait a few extra seconds. But who cares?</p>
<p>521
01:00:00.353 --&gt; 01:00:09.340
Ray Lutz: So? And Daffodil is is a little bit different animal, that actually is all python.</p>
<p>522
01:00:10.120 --&gt; 01:00:15.499
Ray Lutz: and not that pandas is not, you know pandas is numpy, and</p>
<p>523
01:00:15.750 --&gt; 01:00:30.809
Ray Lutz: but it's restrictive in what it can put in into its list, and it's designed around numerics. So when you start adding strings or anything else. It just freaks out. It's just like you're just going to have. Then you got to go back to a dictionary of lists, a list of dictionaries, I mean.</p>
<p>524
01:00:31.100 --&gt; 01:00:36.260
Ray Lutz: and that's what I ended up doing. But then I didn't have the functionality of selecting rows and other things that you want.</p>
<p>525
01:00:38.170 --&gt; 01:00:40.676
Ray Lutz: Now will this catch on? I don't know.</p>
<p>526
01:00:42.030 --&gt; 01:00:46.110
Ray Lutz: I think that that it for most people</p>
<p>527
01:00:46.290 --&gt; 01:00:49.019
Ray Lutz: I mean, I still like to use pandas for</p>
<p>528
01:00:49.170 --&gt; 01:00:54.550
Ray Lutz: for certain things, just because I know that the AI machine knows exactly what to do with it.</p>
<p>529
01:00:55.293 --&gt; 01:01:04.750
Ray Lutz: Once the AI machine understands Daffodil, it might do it, but I don't think it's really the use case. Daffodil is more for programmers than for</p>
<p>530
01:01:05.280 --&gt; 01:01:07.370
Ray Lutz: people who are data analysts.</p>
<p>531
01:01:08.320 --&gt; 01:01:11.200
Ray Lutz: It's more for somebody who wants to program in</p>
<p>532
01:01:12.350 --&gt; 01:01:18.409
Ray Lutz: that use these, I mean, I use them all the time, because if you don't use one.</p>
<p>533
01:01:18.920 --&gt; 01:01:27.129
Ray Lutz: and you're you know that it's not suitable to use pandas for this. So then you want to refer to a column of the array.</p>
<p>534
01:01:27.910 --&gt; 01:01:36.059
Ray Lutz: Well, it's a list of dictionaries, so you don't have columns. You have to go through and write in a comprehension that pulls out the column. You can do that.</p>
<p>535
01:01:36.600 --&gt; 01:01:41.220
Ray Lutz: but it's easier just to have something that's all well tested, and everything that pulls that column out</p>
<p>536
01:01:42.930 --&gt; 01:01:47.280
Ray Lutz: makes conversions and so forth. So it's it's a handy thing to have.</p>
<p>537
01:01:47.590 --&gt; 01:01:49.899
Ray Lutz: and I think it's logical to have</p>
<p>538
01:01:50.070 --&gt; 01:01:52.420
Ray Lutz: a next step up. So we have</p>
<p>539
01:01:52.550 --&gt; 01:01:58.670
Ray Lutz: fairly high level data structures in python, such as lists, dictionaries</p>
<p>540
01:01:59.010 --&gt; 01:02:02.419
Ray Lutz: very highly functional and really nice.</p>
<p>541
01:02:02.920 --&gt; 01:02:09.959
Ray Lutz: But we need to move up a level and have a two-dimensional functional data frame within the python world</p>
<p>542
01:02:10.220 --&gt; 01:02:11.510
Ray Lutz: and and not</p>
<p>543
01:02:12.090 --&gt; 01:02:19.120
Ray Lutz: make it numpy. Not that I'm against numpy. It's just that it's very restrictive as to what you can put into those cells.</p>
<p>544
01:02:20.880 --&gt; 01:02:25.430
Ray Lutz: you can put some strings in. I think it's up to 20 characters or something. So.</p>
<p>545
01:02:26.550 --&gt; 01:02:31.590
Ray Lutz: okay, all right. Well, thank you so much. I guess</p>
<p>546
01:02:32.030 --&gt; 01:02:35.199
Ray Lutz: you're right. It's it's most people</p>
<p>547
01:02:35.570 --&gt; 01:02:43.179
Ray Lutz: I think are going to say, well, I'm just going to continue to use python pandas, because that's what I'm used to, and</p>
<p>548
01:02:43.680 --&gt; 01:02:45.039
Ray Lutz: I don't really care about</p>
<p>549
01:02:45.180 --&gt; 01:02:56.389
Ray Lutz: time so much as what you're saying here, and I'm not doing appending. But if you're doing the appending, if you're building these tables up, that's when Daffodil becomes a pretty handy little tool.</p>
<p>550
01:02:57.490 --&gt; 01:02:59.970
Ray Lutz: Okay, thanks a lot, Gabor. I guess that's the end.</p>
<p>551
01:03:00.360 --&gt; 01:03:09.079
Gabor Szabo: Yeah. So thank you. Thank you again for for giving this presentation. Thank you. Everyone who listened to the were present and listened to the presentation.</p>
<p>552
01:03:09.330 --&gt; 01:03:17.269
Gabor Szabo: If you like the video, then please like it, and follow the channel and see you next time.</p>
<p>553
01:03:17.810 --&gt; 01:03:19.889
Ray Lutz: Okay, thanks. A lot. Okay. Bye.</p>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>Reducing your memory footprint by 75% with 6 lines with Tomer Brisker</title>
    <summary type="html"><![CDATA[]]></summary>
    <updated>2025-02-26T08:30:01Z</updated>
    <pubDate>2025-02-26T08:30:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/reducing-your-memory-footprint" />
    <id>https://python.code-maven.com/reducing-your-memory-footprint</id>
    <content type="html"><![CDATA[<iframe width="560" height="315" src="https://www.youtube.com/embed/VYSxuicxulE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<p>While profiling a slow process I stumbled upon a surprising way to reduce our memory consumption. This talk will present some useful profiling tools, and an important thing to know when using AbstractBaseClass extensively.
In this session, we will dive into the realm of Python optimization, as we cover some essential profiling tools designed to identify and resolve performance bottlenecks in your code. We'll navigate through practical examples, showcasing how these tools can provide invaluable insights into your application's memory and CPU usage patterns.
Furthermore, we'll delve into some nuances of AbstractBaseClass usage, and its implications on speed and memory management in Python applications.
Whether you're a seasoned developer or just starting your journey with Python, this session offers some practical strategies to optimize Python programs effectively.</p>
<p><img src="images/tomer-brisker.jpeg" alt="Tomer Brisker" /></p>
<h2 class="title is-4">Transcript</h2>
<p>1
00:00:02.250 --&gt; 00:00:29.170
Gabor Szabo: Hi, and welcome to the Codeme events, meetings and the Codeme events channel. If you're watching it on Youtube, my name is Gabor Sabo. I usually teach python and rust and help companies with these 2 languages mostly. And I also organize these events because I really like the idea of sharing knowledge, I mean receiving knowledge from other people like this time, Toma.</p>
<p>2
00:00:29.330 --&gt; 00:00:49.120
Gabor Szabo: and from around the world. So that's a good good idea, I think. And that's it. If you're watching the Youtube, then then please, like the video and those who follow the channel and thanks everyone who arrived to this meeting, and especially Tomer, for giving us the presentation. Now it's your turn.</p>
<p>3
00:00:49.510 --&gt; 00:00:52.949
Gabor Szabo: So welcome to introduce yourself. And yeah.</p>
<p>4
00:00:53.120 --&gt; 00:00:56.100
Tomer Brisker: Thank you. Let me share my screen.</p>
<p>5
00:00:58.800 --&gt; 00:01:00.990
Tomer Brisker: Okay, can you see it?</p>
<p>6
00:01:01.920 --&gt; 00:01:02.670
Gabor Szabo: Yes.</p>
<p>7
00:01:03.060 --&gt; 00:01:03.920
Tomer Brisker: Excellent.</p>
<p>8
00:01:05.334 --&gt; 00:01:09.469
Tomer Brisker: Okay, so 1st of all, I have to make a confession.</p>
<p>9
00:01:09.770 --&gt; 00:01:14.040
Tomer Brisker: It wasn't 6 lines of code. It was actually 7 lines of code.</p>
<p>10
00:01:14.460 --&gt; 00:01:28.110
Tomer Brisker: And I guess you're all pretty wondering, curious what these lines of code were. So here they are. Okay. Okay. It was 8 lines of code. If you count the space in between the functions.</p>
<p>11
00:01:28.660 --&gt; 00:01:39.080
Tomer Brisker: and we'll dive into what exactly these lines of code mean a bit later, and why these allowed us to save so much memory.</p>
<p>12
00:01:39.220 --&gt; 00:01:41.780
Tomer Brisker: But 1st of all, just so, you believe me.</p>
<p>13
00:01:41.930 --&gt; 00:01:47.609
Tomer Brisker: this is for a memory usage graph in production. When we deployed this fix.</p>
<p>14
00:01:47.770 --&gt; 00:02:01.359
Tomer Brisker: As you can see, the deployment was around 5 10 Pm. Which is a great time to deploy fixes. If I remember correctly, this was the Thursday, which is the end of the week for Israel, perfect time for deploying to production.</p>
<p>15
00:02:01.870 --&gt; 00:02:13.220
Tomer Brisker: But 1st of all, do we have anyone in the call who happens to be a Us. Citizen or has to file us. Tax supports.</p>
<p>16
00:02:18.200 --&gt; 00:02:21.170
Tomer Brisker: feel free to wave, or something.</p>
<p>17
00:02:21.860 --&gt; 00:02:24.939
Tomer Brisker: If there are, I guess I guess not.</p>
<p>18
00:02:25.090 --&gt; 00:02:31.839
Tomer Brisker: Well, if you were, I guess this would probably look pretty familiar to you.</p>
<p>19
00:02:31.950 --&gt; 00:02:43.169
Tomer Brisker: So for those of you who don't know practically all us citizens are required to file tax supports annually for the Irs</p>
<p>20
00:02:43.440 --&gt; 00:02:47.260
Tomer Brisker: for the income taxes. This is a</p>
<p>21
00:02:47.360 --&gt; 00:02:52.590
Tomer Brisker: pretty painful process. It requires filling out a lot of obscure forms.</p>
<p>22
00:02:53.093 --&gt; 00:03:17.609
Tomer Brisker: And if you make some mistakes on it, you can find yourself in jail. So most people either pay one of the existing companies who provide services for filing tech supports or pay an accountant to do the tax reports for them. So Hi! My name is Tomer. I'm the tech lead at. We are fixing the issue of Irs tech support filing.</p>
<p>23
00:03:18.833 --&gt; 00:03:27.250
Tomer Brisker: We are developing a simple to use application that allows users to file their taxes seamlessly with the irs.</p>
<p>24
00:03:27.793 --&gt; 00:03:48.870
Tomer Brisker: Usually takes most users around half an hour to do their taxes. Which is pretty awesome compared to what it normally takes, which is many hours and oh, and we don't charge them nearly as much as an accountant or one of the existing providers charge for this.</p>
<p>25
00:03:49.694 --&gt; 00:04:01.029
Tomer Brisker: If I look a bit tired in the recording. I'm going to let you guess which one of these are to blame today. Him she's on the left.</p>
<p>26
00:04:01.935 --&gt; 00:04:04.964
Tomer Brisker: She's 6 months old.</p>
<p>27
00:04:05.810 --&gt; 00:04:11.380
Tomer Brisker: I live with my partner and 2 kids and give a time, which is a suburb of Tel Aviv</p>
<p>28
00:04:12.070 --&gt; 00:04:18.920
Tomer Brisker: and our dog but I I guess you're not here to hear about me and my life.</p>
<p>29
00:04:19.149 --&gt; 00:04:22.340
Tomer Brisker: You're here to hear about performance in Python.</p>
<p>30
00:04:23.170 --&gt; 00:04:25.279
Tomer Brisker: So let's dive in.</p>
<p>31
00:04:25.420 --&gt; 00:04:30.670
Tomer Brisker: Our story begins when we noticed there's some</p>
<p>32
00:04:30.830 --&gt; 00:04:35.610
Tomer Brisker: certain action in our system that's taking quite a long time to complete</p>
<p>33
00:04:36.321 --&gt; 00:04:41.589
Tomer Brisker: in fact, we even got some reports from users, complaining that they were hitting timeouts</p>
<p>34
00:04:41.740 --&gt; 00:04:45.890
Tomer Brisker: when they were running this specific action within the system.</p>
<p>35
00:04:46.710 --&gt; 00:05:01.249
Tomer Brisker: and I was assigned to this task, started digging in, and I managed to produce it locally. I created a nice little script that created the exact conditions of the users that were timing out</p>
<p>36
00:05:01.800 --&gt; 00:05:09.590
Tomer Brisker: and the 1st step was to see how long it actually takes. And yeah, it was actually pretty slow.</p>
<p>37
00:05:09.690 --&gt; 00:05:20.060
Tomer Brisker: Python comes with a couple of built in modules in the Standard Library that are pretty nice when you're timing things those time and those time it</p>
<p>38
00:05:20.599 --&gt; 00:05:25.270
Tomer Brisker: I will leave feeding the exact documentations of them to the listener.</p>
<p>39
00:05:26.530 --&gt; 00:05:27.429
Tomer Brisker: But</p>
<p>40
00:05:28.480 --&gt; 00:05:54.030
Tomer Brisker: we actually have. It's it's very useful. If you know what you're looking for. For example, if there's a specific method that you know, is slow, and you want to measure some change that you make to it, and you want to see the impact of it. This is very useful. We even have a little wrapper method that allows us to easily measure the timings for various functions that we call</p>
<p>41
00:05:54.405 --&gt; 00:06:02.299
Tomer Brisker: but what do we do if we're not sure where the slowness is coming from this specific action in the system.</p>
<p>42
00:06:02.470 --&gt; 00:06:16.779
Tomer Brisker: This was a pretty complex section. It involved calling several different services, a lot of methods, so it's pretty difficult if you don't know what where the slowness is coming from.</p>
<p>43
00:06:18.680 --&gt; 00:06:23.669
Tomer Brisker: So this is what profilers were invented for</p>
<p>44
00:06:24.368 --&gt; 00:06:31.150
Tomer Brisker: profiler. There's a TV series called the Profiler. We're not going to talk about it. I have no idea what it's about.</p>
<p>45
00:06:31.430 --&gt; 00:06:35.990
Tomer Brisker: but profilers generally come into different varieties.</p>
<p>46
00:06:36.560 --&gt; 00:07:03.049
Tomer Brisker: They are. There are that we will mention that as we go, the 1st variety is deterministic profilers. These are profilers, that essentially every time you make a method call. They register that method call. They write down the start time, and once that method returns they write down the end time. Python, like standard library has a nice one called C profile.</p>
<p>47
00:07:03.330 --&gt; 00:07:24.280
Tomer Brisker: It gives you a context manager. Basically, you wrap the code that you want to measure with the context manager. Then call whatever slow function you want to profile and save the statistics to a file. And see profile will actually take care of going over all of the method calls within that function.</p>
<p>48
00:07:24.410 --&gt; 00:07:30.459
Tomer Brisker: measuring how long they take, how much, how many times each method is called, etc.</p>
<p>49
00:07:30.560 --&gt; 00:07:41.699
Tomer Brisker: And it saves all of these statistics into a file which can be used can be read using a module called pstats.</p>
<p>50
00:07:41.810 --&gt; 00:08:05.169
Tomer Brisker: Pstats allows you reading these files. And just so, there's a lot of information here. You can see we'll dive into it a little bit to understand what this table is about. So 1st of all, on the right, we can see the file, name, line number and function name, pretty basic. So you know what we're calling here</p>
<p>51
00:08:05.260 --&gt; 00:08:18.769
Tomer Brisker: on the left hand side, we can see the number of calls to each function, as you can see. The 1st 3 here were called just once. This is actually the script that I was using to de- debug this issue.</p>
<p>52
00:08:19.561 --&gt; 00:08:29.669
Tomer Brisker: The second column. Here is total time, which is the time that was spent within this specific method in total all of the times that it was called.</p>
<p>53
00:08:29.840 --&gt; 00:08:39.579
Tomer Brisker: and those cumulative time which is basically the time that was spent within this method and any other method that was called from within that method.</p>
<p>54
00:08:40.289 --&gt; 00:08:40.799
Tomer Brisker: and</p>
<p>55
00:08:41.130 --&gt; 00:08:53.159
Tomer Brisker: something stood out pretty quickly to me out of 12 seconds. Runtime, in total. About 6 seconds or half the runtime was spent in one specific method.</p>
<p>56
00:08:53.955 --&gt; 00:08:59.914
Tomer Brisker: And this method is ABC surplus check. Interesting. Okay?</p>
<p>57
00:09:00.620 --&gt; 00:09:13.350
Tomer Brisker: let's see. And even more interesting is the number of times. This was called so in 12 seconds. We actually called this method 175,000 times.</p>
<p>58
00:09:13.410 --&gt; 00:09:40.590
Tomer Brisker: That's the number on the right. And if those a slash, and another number here, that means that this method was calling another, it was calling itself recursively. So in this case we were calling ABC. Subtract check a bit over 3 million times. So about 20 times for every single call that we were making to subclass check, it was actually making about 20 different calls on average.</p>
<p>59
00:09:41.633 --&gt; 00:09:53.739
Tomer Brisker: Okay, pretty interesting. But still, I'm not quite sure why we're calling this method so many times, or why is it taking so long when this method is being called.</p>
<p>60
00:09:53.920 --&gt; 00:10:04.660
Tomer Brisker: And that's what the second type of profilers is really useful for identifying the second type of profilers is statistical profilers.</p>
<p>61
00:10:04.720 --&gt; 00:10:21.380
Tomer Brisker: These are profilers that basically take a snapshot of your python call stack, or in any other language, the call stack. Every certain interval. Usually this is done. Every few milliseconds, the shorter the interval.</p>
<p>62
00:10:21.380 --&gt; 00:10:37.109
Tomer Brisker: obviously the higher the impact it has on performance. On the other hand, if you set too long of an interval you might miss very quick method calls return within the interval, and they won't actually be registered when running the profile.</p>
<p>63
00:10:37.190 --&gt; 00:10:44.800
Tomer Brisker: and a very common way of looking at statistical profiles is using a tool called flame charts.</p>
<p>64
00:10:45.400 --&gt; 00:11:06.120
Tomer Brisker: The way that flame charts work. Basically, you have 2 axes here. The x-axis is the time. So the bigger the block is on the X-axis. That means the longer time that was spent within that specific block, and the Y-axis is the stack.</p>
<p>65
00:11:06.420 --&gt; 00:11:26.929
Tomer Brisker: So you can see the actual call stack of every single method. And you can see why it was being called. Well, the call was coming from. See the bigger ones, the smaller ones. And then that's very helpful. When you need to debug and identify why, a certain method is being called a lot of times.</p>
<p>66
00:11:27.900 --&gt; 00:11:31.660
Tomer Brisker: So I used one of these statistical profilers</p>
<p>67
00:11:31.990 --&gt; 00:11:54.509
Tomer Brisker: specifically, one called pyspy. There's multiple different profilers available for python, and each language has its own ecosystem of profilers. I'm just showing the ones that I used in this case, but there are various other tools that are useful, and they're all good in their own fields.</p>
<p>68
00:11:55.047 --&gt; 00:12:02.359
Tomer Brisker: So I run a statistical profile of pispy on this web producer that I created.</p>
<p>69
00:12:02.490 --&gt; 00:12:10.869
Tomer Brisker: And hmm, yeah, okay, this is fine. This is fine. I can deal with that.</p>
<p>70
00:12:11.030 --&gt; 00:12:20.069
Tomer Brisker: as you can see, a flame chart when you have a very complex operation, can be very, very, very difficult to read.</p>
<p>71
00:12:20.650 --&gt; 00:12:27.250
Tomer Brisker: Sometimes there's something that stands out, you see, a very big block that's taking a very long time to call.</p>
<p>72
00:12:27.380 --&gt; 00:12:52.389
Tomer Brisker: and you can identify the bottleneck pretty quickly from looking at this. But other cases everything is on fire, and you don't really know what's going on. Specifically, in this case we were seeing a certain method call being called 3 million times, which makes sense that it would be very difficult to identify all of these different calls within the flame chart.</p>
<p>73
00:12:52.640 --&gt; 00:12:57.119
Tomer Brisker: and for that there was a nice tool called Sandwich.</p>
<p>74
00:12:57.330 --&gt; 00:13:02.459
Tomer Brisker: Not that kind of sandwich. There's a tool called speed scope, and it has a</p>
<p>75
00:13:02.670 --&gt; 00:13:05.350
Tomer Brisker: way of showing flame charges.</p>
<p>76
00:13:05.460 --&gt; 00:13:18.199
Tomer Brisker: playing charts in a different way, which is, they call it sandwich. Basically on the left hand side, we can see all of the different method calls within our application within the run that was profiled.</p>
<p>77
00:13:18.560 --&gt; 00:13:27.659
Tomer Brisker: And we can solve this list by the total time and by the self time. This is the same, by the way, as we saw previously total time, and</p>
<p>78
00:13:27.790 --&gt; 00:13:33.126
Tomer Brisker: the cumulative time with in the c profile</p>
<p>79
00:13:33.860 --&gt; 00:13:36.710
Tomer Brisker: And then once you click on one of these</p>
<p>80
00:13:37.000 --&gt; 00:13:42.490
Tomer Brisker: you can see on the right hand side those 2 parts, those the callers and the callees.</p>
<p>81
00:13:42.500 --&gt; 00:14:09.719
Tomer Brisker: The top half shows you where this method was being called from. So, for example, in this case we can see subtest check was mostly called from instance check, and some other internal methods. Also. Here we can see. Instance, check was being called from various other methods. And here the X-axis actually shows the time. That's cumulative by the specific</p>
<p>82
00:14:09.810 --&gt; 00:14:29.109
Tomer Brisker: method. So this isn't a single call. This is the total of the times that it was called from here, and obviously I can show the internals of our system. But you can see that there were several places that we were calling. Instance check, pretty commonly leading to most of</p>
<p>83
00:14:29.170 --&gt; 00:14:41.379
Tomer Brisker: the load. On this method, and on the left, you can see again. This was about 8 seconds in this test run. So quite a long time</p>
<p>84
00:14:41.480 --&gt; 00:14:44.690
Tomer Brisker: from the overall. Time.</p>
<p>85
00:14:45.310 --&gt; 00:14:50.109
Tomer Brisker: Okay, so instance, check subclass check.</p>
<p>86
00:14:50.290 --&gt; 00:15:05.529
Tomer Brisker: This is like built in python stuff, right? And it's not something in our code base. What what should I do about it? It's pretty odd, I don't know. Let's, I guess, ask Dr. Google.</p>
<p>87
00:15:06.302 --&gt; 00:15:16.930
Tomer Brisker: and turns out there's an open issue about ABC subclass check, which has a very poor performance. And I think a memory leak. Hmm!</p>
<p>88
00:15:17.610 --&gt; 00:15:18.680
Tomer Brisker: More league.</p>
<p>89
00:15:19.030 --&gt; 00:15:20.260
Tomer Brisker: Interesting.</p>
<p>90
00:15:21.537 --&gt; 00:15:25.922
Tomer Brisker: Memories. Memory is pretty expensive.</p>
<p>91
00:15:26.940 --&gt; 00:15:35.520
Tomer Brisker: and turns out I'm pretty bad at counting. So there's not actually 2 kinds of profilers, those 3 kinds of profilers.</p>
<p>92
00:15:35.710 --&gt; 00:15:41.029
Tomer Brisker: There's also memory profilers besides the runtime profilers.</p>
<p>93
00:15:41.170 --&gt; 00:16:10.060
Tomer Brisker: memory. Obviously it's expensive. If you need to use and allocate a lot of it, but it's also expensive in terms of performance, because if the python runtime runs out of memory, it has to make system calls to allocate additional memory to the python program. The garbage collector also has to go over all of the memory and clean up unused memory. So the more memory. You allocate the slow, the garbage collection will be</p>
<p>94
00:16:10.120 --&gt; 00:16:38.349
Tomer Brisker: so. These tools, the memory profilers, allow us to identify issues with our memory allocations. Sometimes our program can be very fast. But allocates a very large amount of memory. Just recently we had a case, actually, where a certain process was crashing, and we were seeing pods being killed. So out of memory killed</p>
<p>95
00:16:38.380 --&gt; 00:16:45.739
Tomer Brisker: basically in Kubernetes, when you allocate a certain amount of memory to a process.</p>
<p>96
00:16:45.930 --&gt; 00:17:15.490
Tomer Brisker: if you over, if the process runs over the memory. The Kubernetes controller will kill it. So it doesn't starve out other processes. And in this specific case memory was running out so quickly that it wasn't even sending telemetry data to Prometheus. And we used actually a memory profiler to identify where exactly this memory was being allocated so rapidly that it was killing our pods.</p>
<p>97
00:17:16.339 --&gt; 00:17:34.970
Tomer Brisker: So let's talk about memory profile as a bit. There's a really nice one for Python called memory. It gives you some nice runtime statistics on your program. In this case this is a reproducer script that I was using.</p>
<p>98
00:17:34.970 --&gt; 00:17:51.320
Tomer Brisker: We can see that we actually had about 11 million object allocations during the script. You can notice that the runtime here is a bit longer than the 12 seconds that it took when running with just c profile. And that's because</p>
<p>99
00:17:51.330 --&gt; 00:17:57.990
Tomer Brisker: every single memory allocation that the program does. There's some overhead to it when you're profiling it.</p>
<p>100
00:17:58.010 --&gt; 00:18:04.920
Tomer Brisker: So the Runtime was a bit slower here, and we were allocating nearly 2 GB of memory.</p>
<p>101
00:18:07.340 --&gt; 00:18:10.049
Tomer Brisker: When running this this process.</p>
<p>102
00:18:10.410 --&gt; 00:18:23.900
Tomer Brisker: and it also gives you information like which python memory allocator was being used. Number of frames. That's the number of samples that it was taking. This is also statistical profilers.</p>
<p>103
00:18:25.123 --&gt; 00:18:33.740
Tomer Brisker: And it also shows us a nice flame chart like we saw before. But, unlike the runtime profilers.</p>
<p>104
00:18:33.760 --&gt; 00:18:59.500
Tomer Brisker: this flame chart, the X-axis, is actually the size of the memory allocated, so the wider the block is, that means that the memory allocated within this block was higher. And also we can see specific statistics for a certain method call. For example, in this case we were allocating 82 MB of memory and 2 and a half</p>
<p>105
00:18:59.520 --&gt; 00:19:05.030
Tomer Brisker: a thousand objects were being allocated in a single call to subtrust check.</p>
<p>106
00:19:09.150 --&gt; 00:19:11.982
Tomer Brisker: Okay, so let's go back to the bug.</p>
<p>107
00:19:13.050 --&gt; 00:19:20.099
Tomer Brisker: ABC, subtest check has very poor performance. I think. Memory leak. It's open since May 2022.</p>
<p>108
00:19:20.350 --&gt; 00:19:24.630
Tomer Brisker: Anybody in the audience maybe knows who Samuel Colvin is</p>
<p>109
00:19:26.440 --&gt; 00:19:32.050
Tomer Brisker: feel free to unmute. If you do anyone.</p>
<p>110
00:19:32.150 --&gt; 00:19:48.209
Tomer Brisker: Samuel Corvin, that's the guy behind pydantic Pydantic is a very popular data validation Library for python. He opened this issue almost 3 years ago, and it's still open. So</p>
<p>111
00:19:48.856 --&gt; 00:19:53.720
Tomer Brisker: well, I guess case closed. Python is a slow language.</p>
<p>112
00:19:53.940 --&gt; 00:20:09.769
Tomer Brisker: There's nothing to do about it. We have to rewrite our application completely, using go rust elixir. I don't know what are the cool kids using today, Gabbo, I know you do rust a lot. So I guess rewrite right?</p>
<p>113
00:20:10.990 --&gt; 00:20:16.260
Tomer Brisker: No, I could just decide. This is a case of</p>
<p>114
00:20:16.470 --&gt; 00:20:26.119
Tomer Brisker: language limitations. We have to cope with it, and that's it. But they decided to dig in a little bit deeper and try to figure out if there's something we can do to resolve the issue</p>
<p>115
00:20:26.310 --&gt; 00:20:35.859
Tomer Brisker: and to dig in a little bit deeper. We need to discuss ABC a bit, not the TV network abstract based classes</p>
<p>116
00:20:36.280 --&gt; 00:20:41.090
Tomer Brisker: for those of you who are not familiar with abstract based classes.</p>
<p>117
00:20:41.677 --&gt; 00:20:47.099
Tomer Brisker: Which, as we mentioned, have a fairly poor performance for the subclass check.</p>
<p>118
00:20:47.806 --&gt; 00:20:54.949
Tomer Brisker: Abstract-based classes, is a mechanism in Python that allows us to define an abstract class</p>
<p>119
00:20:55.260 --&gt; 00:21:14.820
Tomer Brisker: which allows us to define certain methods. We require any class that is subclassing from that class to do. We require these methods. So if we try to subclass it, and we don't implement these specific methods, the methods. The interpreter will yell at us saying, Hey.</p>
<p>120
00:21:14.930 --&gt; 00:21:19.040
Tomer Brisker: this class has to implement a certain method.</p>
<p>121
00:21:19.701 --&gt; 00:21:34.009
Tomer Brisker: Usually we use subclass ABC's as makes sense, basically defining a specific interface. We want to implement. But they have an interesting idea that is registering virtual subclasses.</p>
<p>122
00:21:34.662 --&gt; 00:21:38.999
Tomer Brisker: Which is, for example, let's say you have a</p>
<p>123
00:21:39.180 --&gt; 00:21:50.129
Tomer Brisker: base class that you've defined, and you want one of the built-in types of python to be a subclass of that. Obviously, you can't</p>
<p>124
00:21:50.380 --&gt; 00:21:54.520
Tomer Brisker: have int subclassing something else right?</p>
<p>125
00:21:55.065 --&gt; 00:22:09.930
Tomer Brisker: And the ABC. Class, or ABC Meta subclass ABC Meta class. Sorry ABC. Meta Meta class allows us to register various other classes as virtual subclasses of the base class.</p>
<p>126
00:22:09.950 --&gt; 00:22:33.360
Tomer Brisker: This is also useful. If you have a class that implements multiple interfaces, let's say you have a class that implements iterable and implements hashable and implements. I don't know sortable, say, and a few others. Obviously, you don't want to have to declare all of these. When you create a class.</p>
<p>127
00:22:33.860 --&gt; 00:22:50.000
Tomer Brisker: you can just register this class as a virtual subclass. And that means that if you look at the method, resolution, or the Mlo. Of a specific object of that class, you won't see these classes as their parent.</p>
<p>128
00:22:50.150 --&gt; 00:23:15.209
Tomer Brisker: That's by the way, the way usually subclass check works. It checks the method, resolution order, and to see if the parent class is there? But since we have to, we allow registering subclasses virtually to classes that aren't the actual parents, there's a specific implementation within ABC. Matter for subclass check and</p>
<p>129
00:23:15.270 --&gt; 00:23:21.079
Tomer Brisker: instance check that allows support for this specific use case.</p>
<p>130
00:23:21.390 --&gt; 00:23:46.080
Tomer Brisker: So just to better understand this use case, let's say we have this class here we have a base class which is an abstract base class. We have class A class B inheriting from that base class, and we have a virtual subclass which isn't inheriting from the base class, but it's registered as a subclass of this base class, and so forth. We have a virtual subclass, a etc, etc.</p>
<p>131
00:23:46.867 --&gt; 00:23:54.869
Tomer Brisker: Let's say we have an object, and we want to check if that object is a subclass or an instance of base class.</p>
<p>132
00:23:55.358 --&gt; 00:24:02.720
Tomer Brisker: In this case we would. Let's say, this object is of type, virtual, subclass, A, we would need to check</p>
<p>133
00:24:02.910 --&gt; 00:24:04.010
Tomer Brisker: the whole</p>
<p>134
00:24:04.160 --&gt; 00:24:25.339
Tomer Brisker: inheritance tree for base class to identify if this class was registered to any of the classes within that inheritance tree. So this calculation is pretty complex. There's also potentially an issue with a bad implementation of</p>
<p>135
00:24:25.420 --&gt; 00:24:53.739
Tomer Brisker: caching within the implementation of abstract base class. But normally this isn't a big issue, because you wouldn't have that many classes inheriting from a single base class, maybe 2, 3, 1020. Usually it's not noticeable. But, as I mentioned, we're dealing with tax reports and tax filing for the Us. Irs.</p>
<p>136
00:24:53.890 --&gt; 00:25:23.869
Tomer Brisker: There's thousands of different forms that the user needs to fill in. Think of the number of states. Each State has its own forms. Each form is composed of multiple different parts, and you can pretty quickly guess the rough number of classes that we have in our system to enable this fairly complex calculation, which has led us to this issue because of the</p>
<p>137
00:25:23.930 --&gt; 00:25:32.029
Tomer Brisker: very large inheritance that we have from our base class that we use for the calculation.</p>
<p>138
00:25:32.380 --&gt; 00:25:49.000
Tomer Brisker: And that's why going back to the solution. This solution worked. Let's look at it a bit more in depth. Obviously, we're using type definitions. We are not barbarians. Previously I was dropping them just to make it easier to look.</p>
<p>139
00:25:49.550 --&gt; 00:26:11.719
Tomer Brisker: But this is very, very straightforward. Those is subclass. And is this instance, and what they do is they go to type. Type is the base class for all classes in case you're not familiar with it, and they call subclass check or instance check on type directly by default. When you call is instance, or is subclass.</p>
<p>140
00:26:11.990 --&gt; 00:26:39.610
Tomer Brisker: The way it works is, it goes to the class that is on the right hand side of the of the second parameter, basically of the function call, and it checks the Meta class for that class looking for subclass check or instance check depending on which method you called, and then it goes up the method resolution order until it finds the implementation. In case of</p>
<p>141
00:26:39.700 --&gt; 00:27:02.590
Tomer Brisker: an abstract base class, it would go to abstract based class ABC subclass check. But here, what we do is we basically bypass the ABC methods and go directly to the source type subclass check, which is the default implementation used by python for any types that aren't abstract based classes.</p>
<p>142
00:27:03.150 --&gt; 00:27:15.699
Tomer Brisker: And then all we had to do in our code base. It's a very simple change. Use is subclass form, first, st subclass instead of using the default, python implementation.</p>
<p>143
00:27:15.800 --&gt; 00:27:35.569
Tomer Brisker: checking. If it's a subclass of the base model. And the reason this worked is because we didn't really care about the virtual subclass aspect of ABC. In our case we were just checking. If a certain object is or isn't, a subclass of our base model.</p>
<p>144
00:27:35.670 --&gt; 00:27:44.720
Tomer Brisker: This wouldn't work, obviously, if we were actually registering our objects into the base model instead of directly inheriting from it.</p>
<p>145
00:27:44.850 --&gt; 00:27:47.290
Tomer Brisker: But in our case this was good enough.</p>
<p>146
00:27:47.520 --&gt; 00:27:57.079
Tomer Brisker: and, as you can see, there's another very nice added benefit to profiling. It lets you add nice statistics to your pull requests.</p>
<p>147
00:27:57.812 --&gt; 00:28:08.119
Tomer Brisker: For example, Runtime went down from 33 seconds to 26 seconds. Memory usage went was improved by 50%, etc, etc.</p>
<p>148
00:28:08.120 --&gt; 00:28:31.359
Tomer Brisker: Actually, I didn't even have to implement this in all of the places. I only had to switch to using the 1st subclass in very specific places I identified, using profiling as being the most common places that this was being called from, and this already gave me a very significant improvement.</p>
<p>149
00:28:31.360 --&gt; 00:28:56.700
Tomer Brisker: And when we actually deployed, this fix turns out that the impact was even higher than the specific use case that I was profiling because this had impact across the system, it could significantly, as you can see here, reduced our memory load. It also improved the system runtime in general, and also the system load. Time was drastically reduced.</p>
<p>150
00:28:56.960 --&gt; 00:29:08.460
Tomer Brisker: making our deployments much faster and saving a lot of costs, questions anybody.</p>
<p>151
00:29:11.030 --&gt; 00:29:16.439
Gabor Szabo: And 1st of all, thanks, thanks for the presentation. Can you go back? 1, 1 slide.</p>
<p>152
00:29:16.880 --&gt; 00:29:17.540
Tomer Brisker: Yes.</p>
<p>153
00:29:17.680 --&gt; 00:29:20.002
Gabor Szabo: What is this? Bump? The hunter.</p>
<p>154
00:29:20.848 --&gt; 00:29:37.821
Tomer Brisker: That's a good question. Actually, that's we're using warning updates. So basically, we spun up a few pods, switched over to them, then spun up a few more pods and it these are</p>
<p>155
00:29:39.060 --&gt; 00:29:56.730
Tomer Brisker: These are. This is the time when there were still some of the older pods running in parallel with the new pods that were using less memory. So this is the 1st drop is when we killed the 1st batch of the old pods, and the second drop is when we killed the second batch.</p>
<p>156
00:29:57.290 --&gt; 00:29:57.870
Gabor Szabo: Hmm!</p>
<p>157
00:30:00.030 --&gt; 00:30:05.289
Gabor Szabo: But why did this? I still don't understand why it went up, but it go up again.</p>
<p>158
00:30:05.554 --&gt; 00:30:10.840
Tomer Brisker: So we started a few pods, killed a few, and then started a bunch more and then killed the rest.</p>
<p>159
00:30:11.010 --&gt; 00:30:11.490
Tomer Brisker: Oh.</p>
<p>160
00:30:11.490 --&gt; 00:30:12.060
Gabor Szabo: Okay.</p>
<p>161
00:30:12.280 --&gt; 00:30:18.859
Tomer Brisker: Oh, man, this is just a loading of the the new ports! While the all the ones were still running.</p>
<p>162
00:30:19.650 --&gt; 00:30:20.610
Gabor Szabo: Okay. Nice.</p>
<p>163
00:30:24.700 --&gt; 00:30:26.580
Tomer Brisker: Any other questions. Anybody?</p>
<p>164
00:30:29.900 --&gt; 00:30:31.789
Tomer Brisker: Okay, thank you very much.</p>
<p>165
00:30:32.380 --&gt; 00:30:39.630
Gabor Szabo: No, it seems so. Thank you. Thank you for giving this presentation and everyone for being here listening.</p>
<p>166
00:30:40.250 --&gt; 00:30:41.120
Gabor Szabo: And</p>
<p>167
00:30:42.190 --&gt; 00:30:49.600
Gabor Szabo: I'm going to stop the video. But please remember to like the video and follow the channel and see you next time, Toya.</p>
<p>168
00:30:50.180 --&gt; 00:30:52.370
Tomer Brisker: Bye, bye, thanks for having me Nobel.</p>
<p>169
00:30:52.370 --&gt; 00:30:53.150
Gabor Szabo: Bye-bye.</p>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>Simulations for the Mathematically Challenged with Miki Tebeka</title>
    <summary type="html"><![CDATA[]]></summary>
    <updated>2025-02-21T07:30:01Z</updated>
    <pubDate>2025-02-21T07:30:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/simulations-for-the-mathematically-challenged" />
    <id>https://python.code-maven.com/simulations-for-the-mathematically-challenged</id>
    <content type="html"><![CDATA[<p>Question: What are the odds that in a class of 23 students, two have the same birthday?
We'll solve this question and others, using only a <code>for</code> loop and a random number generator.</p>
<p>In this talk, we'll see how to use Monte Carlo simulations to solve various problems that might intimidate you due to lack of match skills.</p>
<p><a href="https://github.com/tebeka/talks/tree/master/pyweb-sim">source code</a></p>
<p><img src="images/miki-tebeka.png" alt="" /></p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/CzcxJjpUEmc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<h2 class="title is-4">Transcript</h2>
<p>1
00:00:02.020 --&gt; 00:00:20.500
Gabor Szabo: So Hi, and welcome to the Codemaven Meetup Group and to the Codemaven Youtube Channel. In case you are watching this in Youtube, my name is Gabor. I provide the training services in python and rust and help companies get start using these languages.</p>
<p>2
00:00:20.740 --&gt; 00:00:28.840
Gabor Szabo: And I also think that it's important to share knowledge among people. So that's why I'm organizing these events, these meetings.</p>
<p>3
00:00:29.820 --&gt; 00:00:40.000
Gabor Szabo: So I would like to welcome everyone who joined us at this meeting, and especially Mickey, for agreeing to give this presentation, and that's it. The floor is yours, Mickey.</p>
<p>4
00:00:40.694 --&gt; 00:00:44.060
Miki Tebeka: Hi, everyone. I am going to share my screen.</p>
<p>5
00:00:44.390 --&gt; 00:00:48.980
Miki Tebeka: Then we will start sure.</p>
<p>6
00:00:50.920 --&gt; 00:00:51.690
Miki Tebeka: Okay.</p>
<p>7
00:00:52.080 --&gt; 00:01:03.720
Miki Tebeka: so we are going to talk about what I call simulations for the mathematically challenged. And this is about the tool called simulation. How you can solve various problems.</p>
<p>8
00:01:03.720 --&gt; 00:01:06.970
Gabor Szabo: Sorry. Just just one thing. Can can you move the.</p>
<p>9
00:01:07.230 --&gt; 00:01:07.910
Miki Tebeka: Oh, there!</p>
<p>10
00:01:07.910 --&gt; 00:01:09.300
Gabor Szabo: This again, I'll do this.</p>
<p>11
00:01:10.010 --&gt; 00:01:11.290
Gabor Szabo: Yeah, thanks.</p>
<p>12
00:01:11.290 --&gt; 00:01:17.899
Miki Tebeka: Moved. Okay, sorry. Okay. So my name is Mickey. I've been a professional developer</p>
<p>13
00:01:18.210 --&gt; 00:01:25.368
Miki Tebeka: 37 years. Now. Give or take work mostly with python and go</p>
<p>14
00:01:26.200 --&gt; 00:01:31.599
Miki Tebeka: I teach. I consult, I write books, I do videos. I</p>
<p>15
00:01:31.870 --&gt; 00:01:46.060
Miki Tebeka: enjoy myself in a very geeky way. And and this is a tool that I used in a couple of occasions. I think it's very simple, but not a lot of people aware of that</p>
<p>16
00:01:46.580 --&gt; 00:01:51.379
Miki Tebeka: and it starts usually with the problem that you have usually a data related problem.</p>
<p>17
00:01:53.670 --&gt; 00:01:58.850
Miki Tebeka: You have a cash. You want to know what are the odds that given that</p>
<p>18
00:01:59.080 --&gt; 00:02:10.100
Miki Tebeka: amount of cache hits. What's the average latency? Other questions that usually involve statistics or probability.</p>
<p>19
00:02:10.680 --&gt; 00:02:16.470
Miki Tebeka: And and then you said, Okay, you know, I'll we're all gigs. Right? So we go and hit the books.</p>
<p>20
00:02:17.910 --&gt; 00:02:23.169
Miki Tebeka: But then you start seeing all these kinds of equations.</p>
<p>21
00:02:23.290 --&gt; 00:02:28.480
Miki Tebeka: and usually around that time. I say, you know what. Maybe this problem is, not that important.</p>
<p>22
00:02:28.670 --&gt; 00:02:30.920
Miki Tebeka: And I'll I'll move to do something.</p>
<p>23
00:02:31.680 --&gt; 00:02:37.979
Miki Tebeka: And what I wanted to show you is basically that if you can write a follow up.</p>
<p>24
00:02:38.110 --&gt; 00:02:39.529
Miki Tebeka: you can do statistics.</p>
<p>25
00:02:40.040 --&gt; 00:02:46.170
Miki Tebeka: and that's it. You don't need more than that. You need the follow up, and you need random.</p>
<p>26
00:02:46.530 --&gt; 00:02:55.529
Miki Tebeka: and these are the only 2 tools that you need in order to work. By the way, I'm going to show code, and if you have questions, feel free to ask</p>
<p>27
00:02:58.270 --&gt; 00:03:12.519
Miki Tebeka: if you don't understand the code, if you want to learn about other things. So what are we going to do is we're going to talk about these 5 problems. So we're going to talk about the game of Qatar. And what are the best tiles going to calculate pipe</p>
<p>28
00:03:12.800 --&gt; 00:03:17.480
Miki Tebeka: randomly, which sounds weird. But it's it's another</p>
<p>29
00:03:18.120 --&gt; 00:03:23.610
Miki Tebeka: interesting uses for simulations. We're going to solve the birthday problem.</p>
<p>30
00:03:23.830 --&gt; 00:03:31.290
Miki Tebeka: Given 23 people, I think. What's what are those the 2 people in say in this group has</p>
<p>31
00:03:32.900 --&gt; 00:03:55.700
Miki Tebeka: the same birthday? We're going to see if person is sick, or what are the odds? That person is sick or not given a test that says that he is sick. And we're going to talk about the Monty Hall problem which is philosophically, it's a statistically interesting, but also a very philosophical question, not a philosopher. So you can discuss it later and see what's going on.</p>
<p>32
00:03:56.150 --&gt; 00:04:02.910
Miki Tebeka: So let's start with Catan. Right? So in Catan we have these tiles, and every tile has a number on it.</p>
<p>33
00:04:03.290 --&gt; 00:04:06.159
Miki Tebeka: and then you throw a couple of dices.</p>
<p>34
00:04:06.370 --&gt; 00:04:13.100
Miki Tebeka: and if the number of the dices matches the number of your tile, then you can do things in the game.</p>
<p>35
00:04:13.290 --&gt; 00:04:20.110
Miki Tebeka: So at the beginning you can pick out where you want to put your places, and it's up to you to decide.</p>
<p>36
00:04:20.269 --&gt; 00:04:26.870
Miki Tebeka: Which tile do you want? And you want to know, you know, which tiles are going to get the most hits. What are the probability</p>
<p>37
00:04:27.060 --&gt; 00:04:28.349
Miki Tebeka: of of doing that?</p>
<p>38
00:04:29.890 --&gt; 00:04:30.850
Miki Tebeka: So</p>
<p>39
00:04:33.760 --&gt; 00:04:39.109
Miki Tebeka: This is this is that. Yes, and I'm old. I'm using vim.</p>
<p>40
00:04:39.440 --&gt; 00:04:54.299
Miki Tebeka: Sue, me later. But I think the code is clear enough. So basically, what we're going to do is we're going to do a dice wall, which is basically just a random number between one and 6. This is coming from there.</p>
<p>41
00:04:54.510 --&gt; 00:05:03.489
Miki Tebeka: And then what we're going to do is we run a simulation. So what we're going? We're going to run a lot of dice roll. So I'm going to do a million</p>
<p>42
00:05:05.690 --&gt; 00:05:07.109
Miki Tebeka: vice vers,</p>
<p>43
00:05:08.070 --&gt; 00:05:21.110
Miki Tebeka: And every time I'm going to do 2 dice rows, right? So I get a number. And I'm updating some kind of counter right? We have counter from collections. This is a special data structure that basically stores</p>
<p>44
00:05:21.520 --&gt; 00:05:29.950
Miki Tebeka: how much data we have per count.</p>
<p>45
00:05:30.090 --&gt; 00:05:35.210
Miki Tebeka: And then I'm going over all the numbers right. The minimal</p>
<p>46
00:05:35.680 --&gt; 00:05:48.199
Miki Tebeka: number that you can get with rolling 2 dices is 2, 2 ones and maximum. One is 12, but the range is half open. So we're not going to get there. I'm going to show the fraction</p>
<p>47
00:05:49.340 --&gt; 00:05:53.260
Miki Tebeka: how many? Hey?</p>
<p>48
00:05:54.170 --&gt; 00:05:57.799
Miki Tebeka: This number of the total counts, and I'm going to print it up.</p>
<p>49
00:05:57.920 --&gt; 00:06:15.039
Miki Tebeka: That's that's the code that I'm going to do. And then, if you're going to do pythonpy, you're going to see now, we get probabilities right? And you see that unsurprisingly 7</p>
<p>50
00:06:15.290 --&gt; 00:06:34.940
Miki Tebeka: has the best percentage to to get roll of 2 dices. And you can do the what I call the hard way, which is just, you know, doing all the combinations of all the dice rolls, and then calculate how many there are. But for me as a programmer. This is much easier.</p>
<p>51
00:06:35.610 --&gt; 00:06:39.139
Miki Tebeka: Why, they just write some code. This is like 20 lines of code.</p>
<p>52
00:06:39.370 --&gt; 00:06:45.859
Miki Tebeka: pretty simple. And now I have it. So this is the basic of simulation. We basically</p>
<p>53
00:06:47.220 --&gt; 00:06:54.540
Miki Tebeka: create scenarios using some kind of randomness in scenario. And then we are going to</p>
<p>54
00:06:55.030 --&gt; 00:07:04.519
Miki Tebeka: calculate some statistics about what happened on every scenario, and finally display the result. And this is known as a simulation or Monte Carlo simulation</p>
<p>55
00:07:04.710 --&gt; 00:07:05.669
Miki Tebeka: for what we</p>
<p>56
00:07:08.870 --&gt; 00:07:10.259
Miki Tebeka: questions about this one</p>
<p>57
00:07:16.070 --&gt; 00:07:17.270
Miki Tebeka: no questions.</p>
<p>58
00:07:17.570 --&gt; 00:07:21.660
Miki Tebeka: Alright. By the way, if you ask questions, just open the mic and ask questions, cause</p>
<p>59
00:07:22.140 --&gt; 00:07:26.409
Miki Tebeka: it's hard for me to focus both on the code and on on the zoom screen.</p>
<p>60
00:07:26.960 --&gt; 00:07:34.329
Miki Tebeka: Okay, the next thing we're going to do it's pretty interesting. We're going to calculate pi again randomly.</p>
<p>61
00:07:34.830 --&gt; 00:07:35.890
Miki Tebeka: So</p>
<p>62
00:07:36.820 --&gt; 00:07:42.999
Miki Tebeka: what the way we're going to do it is, we're going to say, let's take a circle which has a radius of one.</p>
<p>63
00:07:43.810 --&gt; 00:07:48.237
Miki Tebeka: And now we're going to concentrate only on top right</p>
<p>64
00:07:48.990 --&gt; 00:07:57.920
Miki Tebeka: square, which is the bonding square for this circle, and we're going to start getting random dots</p>
<p>65
00:07:58.290 --&gt; 00:08:02.520
Miki Tebeka: if the dot falls in the circle, I'm going to paint them as</p>
<p>66
00:08:03.110 --&gt; 00:08:08.290
Miki Tebeka: green, and if it falls outside of the circle, I'm going to paint them as red.</p>
<p>67
00:08:08.860 --&gt; 00:08:16.689
Miki Tebeka: Okay? So once I've done it enough times I can calculate what is the ratio between the green dots and the red dots.</p>
<p>68
00:08:17.180 --&gt; 00:08:21.580
Miki Tebeka: and this ratio is quarter of a pipe.</p>
<p>69
00:08:24.010 --&gt; 00:08:28.210
Miki Tebeka: right? Because the the area of the</p>
<p>70
00:08:30.090 --&gt; 00:08:32.335
Miki Tebeka: the way of the circle is</p>
<p>71
00:08:33.780 --&gt; 00:08:37.330
Miki Tebeka: pi r squared. But r is one. So it's just pi.</p>
<p>72
00:08:37.840 --&gt; 00:08:42.050
Miki Tebeka: so basically, the amount of dust that falls inside the circle should be pi.</p>
<p>73
00:08:42.320 --&gt; 00:08:50.420
Miki Tebeka: but we we doing it only on a quarter of a circle. This is a quarter of a pi, and we are going to get the number pi.</p>
<p>74
00:08:50.800 --&gt; 00:08:56.720
Miki Tebeka: So this is pi, dot, PP. 1,</p>
<p>75
00:08:57.840 --&gt; 00:09:13.699
Miki Tebeka: so again, we we're going to import this time uniform from random, and then sq, and this is going to run for a bit. So I'm going to display display progress bar with something called Tqdm.</p>
<p>76
00:09:15.030 --&gt; 00:09:22.700
Miki Tebeka: and then the radius of is one, and we have n, which is the number of iterations which is a hundred 1 million.</p>
<p>77
00:09:23.370 --&gt; 00:09:35.250
Miki Tebeka: and inner, is the number of points that are inside the circle, which I'm going to do with to start with 0, right? So I'm getting X and Y, which is uniform between 0 and one.</p>
<p>78
00:09:35.960 --&gt; 00:09:42.390
Miki Tebeka: And then, if the point falls inside the circle, I'm just going to increment inner.</p>
<p>79
00:09:43.700 --&gt; 00:09:46.760
Miki Tebeka: So this is how many points fell inside the signal.</p>
<p>80
00:09:48.670 --&gt; 00:10:01.319
Miki Tebeka: Now the ratio is inner divided by N. And as we said, this is quarter of a pi, so we need to print out 4 times this ratio to get to the number of pi</p>
<p>81
00:10:07.660 --&gt; 00:10:14.720
Miki Tebeka: and you know what I'm going to also run the time command to show you how much time it took. So this is, this is a hundred 1 million</p>
<p>82
00:10:15.110 --&gt; 00:10:19.840
Miki Tebeka: Ron's so it's going to take a little bit of fine.</p>
<p>83
00:10:20.300 --&gt; 00:10:21.420
Miki Tebeka: And</p>
<p>84
00:10:23.110 --&gt; 00:10:30.290
Miki Tebeka: and it's a good thing in the winter, because it also warms up your CPU, so you can warm yourself without using the A/C.</p>
<p>85
00:10:35.020 --&gt; 00:10:43.789
Miki Tebeka: I said. Tqdm, which shows the progress bar is really nice, especially if you have long running processes, you know, if your process is actually running or it's not stuck.</p>
<p>86
00:10:43.930 --&gt; 00:10:45.236
Miki Tebeka: So we're</p>
<p>87
00:10:46.570 --&gt; 00:10:56.980
Miki Tebeka: So it's been done. And and we see that we got a 3.1 4.</p>
<p>88
00:10:57.380 --&gt; 00:11:04.729
Miki Tebeka: Yeah, which is close enough to pie. And it took us about 41 seconds to run.</p>
<p>89
00:11:06.320 --&gt; 00:11:12.810
Miki Tebeka: Now, one thing that can help you with simulations is pip pi-pipe.</p>
<p>90
00:11:12.980 --&gt; 00:11:24.209
Miki Tebeka: So if you're not familiar, the python we are using is called C. Python. It's python written in C. There are other pythons, such as Jython, which is Python, written in Java.</p>
<p>91
00:11:24.530 --&gt; 00:11:30.700
Miki Tebeka: and several others, and micropython for micro devices, and there is pi pi.</p>
<p>92
00:11:30.970 --&gt; 00:11:39.090
Miki Tebeka: which is a python written in Python, and it has several optimizations that are not in C. Python.</p>
<p>93
00:11:39.330 --&gt; 00:11:48.439
Miki Tebeka: especially a git compiler, though from 3 13 and up we have an experimental git compiler in C title which should bring it.</p>
<p>94
00:11:49.475 --&gt; 00:11:54.449
Miki Tebeka: and if I'm going to run a pi po eye on that thing.</p>
<p>95
00:11:54.640 --&gt; 00:11:56.969
Miki Tebeka: you're going to see the difference.</p>
<p>96
00:11:57.100 --&gt; 00:12:05.319
Miki Tebeka: Right? No, forgot the time. Command like. You see, this is</p>
<p>97
00:12:05.520 --&gt; 00:12:12.520
Miki Tebeka: 3.5 seconds, so more than 10 times faster on these calculations. So I'm not going to. I'm not saying.</p>
<p>98
00:12:13.420 --&gt; 00:12:22.700
Miki Tebeka: Use pi for everything. There are some compatibility issues with, especially external libraries and maybe other things, and it's not</p>
<p>99
00:12:22.890 --&gt; 00:12:30.900
Miki Tebeka: on par with python. Currently, I think they're on they're on on 3, 10,</p>
<p>100
00:12:31.160 --&gt; 00:12:36.990
Miki Tebeka: equivalent to 3, 10. And right now in Python we are in 3 13. So they take some time</p>
<p>101
00:12:37.400 --&gt; 00:12:40.189
Miki Tebeka: they're building up, and then they close it.</p>
<p>102
00:12:40.690 --&gt; 00:12:48.000
Miki Tebeka: But it's a it's a nice tool to know and and work with questions about this one.</p>
<p>103
00:12:56.360 --&gt; 00:12:59.850
Gabor Szabo: It's not about this one. And and probably it's not</p>
<p>104
00:13:00.490 --&gt; 00:13:13.559
Gabor Szabo: relevant to this, this presentation. But maybe it's for another time is how come actually, this pi can be pi can be so much faster. I would really like to understand this.</p>
<p>105
00:13:13.560 --&gt; 00:13:29.079
Miki Tebeka: The the current. C. Python itself does not do any optimization. So if you look at the C compiler, it has tons of optimization loop, unrolling, constant folding, a lot of many things that the python temperature is not doing at all.</p>
<p>106
00:13:29.784 --&gt; 00:13:40.270
Miki Tebeka: And the other thing is that there is a technology called jit, which is just in time compilation, which means that you run the code. Once in python.</p>
<p>107
00:13:40.430 --&gt; 00:13:45.080
Miki Tebeka: you see what happens there and then you generate specific machine code.</p>
<p>108
00:13:45.230 --&gt; 00:13:52.810
Miki Tebeka: And next time you call the function, it is actually, not the python function. It's called, but the</p>
<p>109
00:13:53.160 --&gt; 00:14:03.310
Miki Tebeka: optimized generated machine code for that. And this is something that Nodejs and other dynamic languages are using, including Java.</p>
<p>110
00:14:03.590 --&gt; 00:14:08.849
Miki Tebeka: to make things faster. And Pypi has pipi. Sorry. Pi Pi has a</p>
<p>111
00:14:09.040 --&gt; 00:14:12.810
Miki Tebeka: a very good git compiler that has been developed for a lot of years.</p>
<p>112
00:14:13.449 --&gt; 00:14:21.089
Miki Tebeka: And that's why it's faster. Basically, pi is written in Python, but eventually generates a</p>
<p>113
00:14:21.520 --&gt; 00:14:28.779
Miki Tebeka: and executable in machine code. So it is pretty fast in this case.</p>
<p>114
00:14:32.600 --&gt; 00:14:34.444
Miki Tebeka: Okay, so</p>
<p>115
00:14:35.730 --&gt; 00:14:43.390
Miki Tebeka: someone joked that, you know, this is every time they go on a Zoom Meeting. That's what comes to their mind right? This is very similar to</p>
<p>116
00:14:43.630 --&gt; 00:14:55.150
Miki Tebeka: to Zoom, and the idea is, and this is known as the birthday problem. Given a group of people.</p>
<p>117
00:14:55.500 --&gt; 00:15:00.259
Miki Tebeka: what are the odds that 2 people have the same birthday?</p>
<p>118
00:15:00.800 --&gt; 00:15:01.740
Miki Tebeka: And</p>
<p>119
00:15:08.330 --&gt; 00:15:10.020
Miki Tebeka: if a</p>
<p>120
00:15:11.310 --&gt; 00:15:19.429
Miki Tebeka: what I pick up as a group. Size is 23 people. So I would like you to just take a guess</p>
<p>121
00:15:19.730 --&gt; 00:15:24.659
Miki Tebeka: like we have a group of 23 people. What are the odds that 2 people have the same birthday?</p>
<p>122
00:15:24.930 --&gt; 00:15:35.990
Miki Tebeka: Yeah. So around the birthday. So I'm going to basically say that I'm not looking at dates. I'm looking at day of year. So we have 365 days per year. So</p>
<p>123
00:15:36.480 --&gt; 00:15:41.359
Miki Tebeka: a random date is basically a number between one and 365. That's</p>
<p>124
00:15:41.590 --&gt; 00:15:50.350
Miki Tebeka: how many days we have in here. And now what I'm saying is given a groups of a given size. Are there any duplicates in the group?</p>
<p>125
00:15:50.770 --&gt; 00:16:00.790
Miki Tebeka: Alright? So basically I creating a set and then going over the numbers generating around the birthday. And then, if it's already. If you already seen this birthday.</p>
<p>126
00:16:01.290 --&gt; 00:16:06.209
Miki Tebeka: we say there is a duplication, so at least 2 people have the same birthday in that group.</p>
<p>127
00:16:06.350 --&gt; 00:16:16.929
Miki Tebeka: otherwise we add it to the group and return. And finally, if you're out of the follow, we say, no, there are no duplicates in this group. So basically, we draw a group of</p>
<p>128
00:16:17.140 --&gt; 00:16:18.390
Miki Tebeka: random numbers.</p>
<p>129
00:16:19.026 --&gt; 00:16:24.570
Miki Tebeka: Between one and 365, and say, is there an overlap here somewhere?</p>
<p>130
00:16:26.440 --&gt; 00:16:38.710
Miki Tebeka: Alright? And now we start again. The simulation. So simulation now is going to run over a million. We're going to do it 1 million times. The group size is 23. And again, none of the applications is</p>
<p>131
00:16:38.900 --&gt; 00:16:48.529
Miki Tebeka: 0 to begin with. And then we run the simulation. And if there is a duplication, we say, Okay, let's increment duplication.</p>
<p>132
00:16:48.640 --&gt; 00:16:54.299
Miki Tebeka: Finally, we are printing what the fraction is from the total.</p>
<p>133
00:16:55.750 --&gt; 00:16:58.950
Miki Tebeka: Okay, so remember the number you guessed.</p>
<p>134
00:17:09.520 --&gt; 00:17:12.040
Miki Tebeka: anyone guessed 50%.</p>
<p>135
00:17:15.579 --&gt; 00:17:18.080
Miki Tebeka: Right? This is seems pretty high right.</p>
<p>136
00:17:18.319 --&gt; 00:17:24.920
Miki Tebeka: And this is something that happens with statistics and probabilities a lot of time. This is not intuitive.</p>
<p>137
00:17:25.069 --&gt; 00:17:33.850
Miki Tebeka: A lot of times we think that we know the answer, and we say, you know, there are a lot of days, only 23 people. How comes? But</p>
<p>138
00:17:34.070 --&gt; 00:17:41.780
Miki Tebeka: if you do it and you do the statistical computation, you'll get exactly the same thing. That's that's the idea. But</p>
<p>139
00:17:42.430 --&gt; 00:17:48.739
Miki Tebeka: again, this is a a more complicated the</p>
<p>140
00:17:49.250 --&gt; 00:17:53.420
Miki Tebeka: computation. And for me, as a developer. This is pretty easy.</p>
<p>141
00:17:53.760 --&gt; 00:17:55.630
Miki Tebeka: you know a follow up around them.</p>
<p>142
00:17:56.360 --&gt; 00:17:57.329
Miki Tebeka: I'm done with.</p>
<p>143
00:18:01.430 --&gt; 00:18:10.880
Gabor Szabo: Yeah, it's nice. What would be interesting, I think, is to see if you take 2 people, 3 people and so on. Up to 365 people.</p>
<p>144
00:18:11.340 --&gt; 00:18:15.830
Gabor Szabo: and for each number to see this, the probability and then graph.</p>
<p>145
00:18:16.185 --&gt; 00:18:27.929
Miki Tebeka: Pretty sure there is. If you go on the web. This is a really known problem. People have done it already. There's a. You can probably see the graph for that.</p>
<p>146
00:18:29.194 --&gt; 00:18:37.129
Miki Tebeka: But this is not exactly true. Right? This is a joke right?</p>
<p>147
00:18:37.280 --&gt; 00:18:46.709
Miki Tebeka: And the chances of a piece of bread falling on the on butter side down is directly proportional to the cost of the carpet, like it's not 50 50.</p>
<p>148
00:18:47.040 --&gt; 00:18:49.360
Miki Tebeka: This is correlations to Mr. Murphy.</p>
<p>149
00:18:49.580 --&gt; 00:18:50.490
Miki Tebeka: Oh.</p>
<p>150
00:18:50.730 --&gt; 00:19:02.460
Miki Tebeka: and the thing is that not everybody is born or does not an average distribution on the days that you are born, especially if you're born on March 29, th</p>
<p>151
00:19:03.080 --&gt; 00:19:09.370
Miki Tebeka: right? This is reducing the odds that you're going to have someone with the same birthday.</p>
<p>152
00:19:12.510 --&gt; 00:19:13.389
Miki Tebeka: So</p>
<p>153
00:19:17.110 --&gt; 00:19:32.149
Miki Tebeka: we. We have a model, and there's a saying, by George, Box, which I really believe that all models are wrong. But some are useful right? So it needs to be interesting enough, or or the answer should be interesting enough, but not to</p>
<p>154
00:19:34.740 --&gt; 00:19:35.405
Miki Tebeka: to</p>
<p>155
00:19:38.780 --&gt; 00:19:45.018
Miki Tebeka: to give you a useful answer. But you can have a model. So I actually took some data from</p>
<p>156
00:19:45.590 --&gt; 00:19:48.589
Miki Tebeka: for about birthdays in general.</p>
<p>157
00:19:49.030 --&gt; 00:19:53.730
Miki Tebeka: Right? So no, this is</p>
<p>158
00:19:55.770 --&gt; 00:20:06.170
Miki Tebeka: us both. Right? So the the Csv file gives you. What is the birthday? Per file? Oh, this is a windows. New lines. Yay,</p>
<p>159
00:20:07.250 --&gt; 00:20:08.130
Miki Tebeka: so</p>
<p>160
00:20:09.380 --&gt; 00:20:14.699
Miki Tebeka: We have a year, month, day of birth, day of the week, and and how many birth there were</p>
<p>161
00:20:15.070 --&gt; 00:20:17.320
Miki Tebeka: per every one of them.</p>
<p>162
00:20:17.941 --&gt; 00:20:24.589
Miki Tebeka: And then what I'm going to do is I'm going to do actually weighted probabilities.</p>
<p>163
00:20:24.910 --&gt; 00:20:28.047
Miki Tebeka: meaning it's not that every day has the</p>
<p>164
00:20:29.080 --&gt; 00:20:41.680
Miki Tebeka: the same probability. But I'm going to use these frequencies to do that. And here I'm going to switch over to tools from the scientific python side of things.</p>
<p>165
00:20:42.410 --&gt; 00:20:45.390
Miki Tebeka: And these tools are pandas and numpy.</p>
<p>166
00:20:46.170 --&gt; 00:21:04.579
Miki Tebeka: And I'm using pandas. If you're not familiar with pandas, and this may be a topic for a different talk, or probably already done it before. Then, this is a really really great library for working with data. I'm using it to load the Csv</p>
<p>167
00:21:06.360 --&gt; 00:21:15.900
Miki Tebeka: so basically, I'm loading the birthdays from this Csv file, and I'm I'm and</p>
<p>168
00:21:18.510 --&gt; 00:21:20.829
Miki Tebeka: converting things to daytime.</p>
<p>169
00:21:20.980 --&gt; 00:21:25.829
Miki Tebeka: And then I'm saying that the the birthday is the day and the month.</p>
<p>170
00:21:27.370 --&gt; 00:21:36.752
Miki Tebeka: and then I'm doing what is known as a group by to get all of the people that were born on the same day in the month</p>
<p>171
00:21:38.260 --&gt; 00:21:45.690
Miki Tebeka: divided by the total number of probabilities, and then return the index and the values am.</p>
<p>172
00:21:46.010 --&gt; 00:21:52.929
Miki Tebeka: So once I have that I can do something else I'm going to use now. Numpy</p>
<p>173
00:21:53.320 --&gt; 00:21:59.770
Miki Tebeka: numpy has a random choice. Basically choice says, pick things from from a group.</p>
<p>174
00:22:00.274 --&gt; 00:22:10.019
Miki Tebeka: But and if you don't say anything, it's it's going to give an equal probability to everything. But you can provide the size and the probabilities.</p>
<p>175
00:22:10.230 --&gt; 00:22:18.226
Miki Tebeka: and then it is going to do a weighted probability meaning there's a bigger chance. If if people are more people are born on.</p>
<p>176
00:22:19.980 --&gt; 00:22:24.335
Miki Tebeka: What is that? September 9, let's say then,</p>
<p>177
00:22:24.950 --&gt; 00:22:30.960
Miki Tebeka: there is a bigger chance. It's going to pick. September 9 versus February 29, th</p>
<p>178
00:22:31.430 --&gt; 00:22:36.951
Miki Tebeka: let's say, or other day, April 26. For some reason I don't know why</p>
<p>179
00:22:37.570 --&gt; 00:22:42.380
Miki Tebeka: people don't like their birthday, they think, or July, July 4.th</p>
<p>180
00:22:43.140 --&gt; 00:22:46.209
Miki Tebeka: I don't know why they have lessened.</p>
<p>181
00:22:46.480 --&gt; 00:22:53.181
Miki Tebeka: Okay, so I'm loading the birthdays from from the Csv file. I'm doing</p>
<p>182
00:22:56.150 --&gt; 00:23:04.390
Miki Tebeka: And again, no 100,000 this time. Simulations. The group size is 23, and</p>
<p>183
00:23:04.550 --&gt; 00:23:10.139
Miki Tebeka: duplicate is 0. And again, I'm doing the same same. Follow up. And again the same thing the time</p>
<p>184
00:23:10.290 --&gt; 00:23:11.440
Miki Tebeka: I'm doing here.</p>
<p>185
00:23:17.470 --&gt; 00:23:22.520
Miki Tebeka: And if I'm running this one we got the same number.</p>
<p>186
00:23:23.640 --&gt; 00:23:33.690
Miki Tebeka: But this is rounding up. Pretty sure if I'm going to show more digits after the decimal. It is going to show more. You're going to see the difference. But the idea is that</p>
<p>187
00:23:34.725 --&gt; 00:23:38.680
Miki Tebeka: we we added more. We made our model more accurate.</p>
<p>188
00:23:38.790 --&gt; 00:23:42.699
Miki Tebeka: but the inner, the even the inaccurate model, was good enough.</p>
<p>189
00:23:42.850 --&gt; 00:23:48.714
Miki Tebeka: right? It was good enough, and that's what saying that all models are wrong, but some are useful.</p>
<p>190
00:23:49.760 --&gt; 00:23:59.990
Miki Tebeka: you don't have to have exact distribution of exact information about your data to gain some insights which are correct from the data. And a lot of time.</p>
<p>191
00:24:00.420 --&gt; 00:24:02.260
Miki Tebeka: you can do approximations.</p>
<p>192
00:24:02.910 --&gt; 00:24:09.580
Miki Tebeka: And statistically, it's still good Facebook questions.</p>
<p>193
00:24:16.730 --&gt; 00:24:17.690
Miki Tebeka: No question.</p>
<p>194
00:24:18.270 --&gt; 00:24:25.577
Miki Tebeka: Okay, so this is about</p>
<p>195
00:24:26.270 --&gt; 00:24:32.039
Miki Tebeka: a question that actually, they they gave to doctors. They said that there is a test for disease</p>
<p>196
00:24:32.370 --&gt; 00:24:42.200
Miki Tebeka: that has 5% false positives, meaning that in 5% of the people. The test will tell you that you're sick, even though you're not sick.</p>
<p>197
00:24:42.530 --&gt; 00:24:44.750
Miki Tebeka: This is what is known as a false positive.</p>
<p>198
00:24:45.320 --&gt; 00:24:51.000
Miki Tebeka: and he says that it knows that the disease strikes about one person in a thousand in the population.</p>
<p>199
00:24:52.330 --&gt; 00:24:55.103
Miki Tebeka: Okay, they said, okay, now we we're taking</p>
<p>200
00:24:55.970 --&gt; 00:25:08.399
Miki Tebeka: a random test. We take a random person from the street, we make the we make the test, and the test says this person is sick. What is the actual probability that this patient is really sick.</p>
<p>201
00:25:09.550 --&gt; 00:25:11.930
Miki Tebeka: Okay? And think about that.</p>
<p>202
00:25:12.388 --&gt; 00:25:18.140
Miki Tebeka: With Covid, for example. Right they swap you for Covid and says you have Covid. Now</p>
<p>203
00:25:18.430 --&gt; 00:25:38.570
Miki Tebeka: you're home. You're not allowed to go out sometimes, you know, for other diseases. It might say that the doctor says, Okay, you're sick now you need a treatment, maybe a violent treatment, maybe something which costs a lot of money. So they ask his doctors to see if they're actually basing their decisions on something which makes sense or not.</p>
<p>204
00:25:38.690 --&gt; 00:25:43.869
Miki Tebeka: Right. So talking about true positives, right? So</p>
<p>205
00:25:44.450 --&gt; 00:25:50.620
Miki Tebeka: if I predicted that the person is sick and they're actually sick. This is what known as a true positive.</p>
<p>206
00:25:51.430 --&gt; 00:25:57.499
Miki Tebeka: We talked about false positive, which is person, is said to be sick, but they are actually healthy.</p>
<p>207
00:25:59.040 --&gt; 00:26:03.640
Miki Tebeka: and we also have false negative saying, this is now</p>
<p>208
00:26:03.860 --&gt; 00:26:06.710
Miki Tebeka: person who is sick, and we said that they're healthy.</p>
<p>209
00:26:06.920 --&gt; 00:26:15.630
Miki Tebeka: and we have a true negative, which is a healthy person, and that says they are healthy. Right? Remember, positive is sick, negative, healthy.</p>
<p>210
00:26:16.180 --&gt; 00:26:19.649
Miki Tebeka: That's it. This thing is known as a confusion matrix.</p>
<p>211
00:26:19.910 --&gt; 00:26:25.470
Miki Tebeka: And on the confusion matrix, you can do a lot of</p>
<p>212
00:26:25.910 --&gt; 00:26:29.789
Miki Tebeka: calculations. When you measure your models.</p>
<p>213
00:26:30.060 --&gt; 00:26:43.740
Miki Tebeka: especially prediction models, you start with the confusion matrix and then say, what is the percent of true positive, can I? There's precision, recall, and and several other things that come to mind.</p>
<p>214
00:26:44.070 --&gt; 00:26:51.570
Miki Tebeka: I call it the. I think the name confusion. Matrix is also very good, because</p>
<p>215
00:26:51.760 --&gt; 00:26:55.139
Miki Tebeka: I always get confused by that. I need to go back and think about</p>
<p>216
00:26:56.000 --&gt; 00:26:58.449
Miki Tebeka: what every term is saying. But we do that.</p>
<p>217
00:26:58.600 --&gt; 00:27:04.280
Miki Tebeka: So let's have a look at the signal. Okay, so</p>
<p>218
00:27:07.510 --&gt; 00:27:10.059
Miki Tebeka: I have a function for warning. So</p>
<p>219
00:27:10.560 --&gt; 00:27:14.720
Miki Tebeka: I want to say that one in</p>
<p>220
00:27:15.310 --&gt; 00:27:27.320
Miki Tebeka: a thousand. Right? They said, one in a thousand is sick. So basically what they're saying, I'm drawing a number between one and N, and if this number is one, and I can pick any number between. N. So 17</p>
<p>221
00:27:27.780 --&gt; 00:27:31.560
Miki Tebeka: 3, any number will will work. I just need</p>
<p>222
00:27:31.730 --&gt; 00:27:34.390
Miki Tebeka: that one in Ni will happen.</p>
<p>223
00:27:36.230 --&gt; 00:27:43.460
Miki Tebeka: And now, I'm I'm going to say, like this. There is a person.</p>
<p>224
00:27:44.190 --&gt; 00:27:56.340
Miki Tebeka: and if the person is sick we are going to say that the person is sick. This is not specified in the question in the equation, but this is the assumption, the assumption that there are no false negatives.</p>
<p>225
00:27:56.680 --&gt; 00:27:58.850
Miki Tebeka: They're only false, positive.</p>
<p>226
00:27:59.020 --&gt; 00:28:06.990
Miki Tebeka: So if you are doing that, and they say if it, the disease is.</p>
<p>227
00:28:07.220 --&gt; 00:28:27.909
Miki Tebeka: we say we have a 5% false positive. So in one. In 20 cases this test is also going to say true. So if the person is sick, the test is going to say, for sure you're sick. If you're healthy, there's 1 in 5%, one in 5, 1 in 20. Sorry chance that you're the test will still say that you're sick, even though you're healthy.</p>
<p>228
00:28:30.070 --&gt; 00:28:35.889
Miki Tebeka: So now, number of sick people and number of people who are diagnosed as sick are both starting at 0.</p>
<p>229
00:28:36.150 --&gt; 00:28:40.569
Miki Tebeka: And now we are running a million simulations</p>
<p>230
00:28:40.720 --&gt; 00:28:47.079
Miki Tebeka: for every one of them. We are picking a person at random, so the chances of the person being sick</p>
<p>231
00:28:47.240 --&gt; 00:28:57.279
Miki Tebeka: like we said, Here, and the disease strikes 1 1 of every 1,000 people in the population.</p>
<p>232
00:28:57.390 --&gt; 00:29:01.800
Miki Tebeka: Right? So there's a 1 in a thousand chance that this person is sick.</p>
<p>233
00:29:02.450 --&gt; 00:29:06.310
Miki Tebeka: and if the person is sick, then we increment the number of sick.</p>
<p>234
00:29:06.440 --&gt; 00:29:10.619
Miki Tebeka: and then we do the diagonals for the person, and if</p>
<p>235
00:29:11.990 --&gt; 00:29:14.820
Miki Tebeka: we diagnose the person is sick, we increment the number of.</p>
<p>236
00:29:14.970 --&gt; 00:29:18.160
Miki Tebeka: But okay.</p>
<p>237
00:29:18.760 --&gt; 00:29:26.069
Miki Tebeka: So now we have the number of people who are actually sick. And I know people who were diagnosed as sick. And we are going to</p>
<p>238
00:29:27.110 --&gt; 00:29:31.512
Miki Tebeka: print out this frequency that says,</p>
<p>239
00:29:33.000 --&gt; 00:29:38.800
Miki Tebeka: what are the percentage of people who are actually sick from the people who are diagnosed as sick.</p>
<p>240
00:29:41.160 --&gt; 00:29:42.610
Miki Tebeka: Anyone care to guess</p>
<p>241
00:29:51.340 --&gt; 00:29:54.650
Miki Tebeka: 2%. See, you are good.</p>
<p>242
00:29:57.160 --&gt; 00:30:02.210
Miki Tebeka: Okay? So a lot of people are saying, you know, this is</p>
<p>243
00:30:02.340 --&gt; 00:30:04.820
Miki Tebeka: what right? The the test is</p>
<p>244
00:30:05.020 --&gt; 00:30:15.159
Miki Tebeka: only 5% false, positive. How come? So 95%. It's okay. But still only 2% of the people are sick.</p>
<p>245
00:30:15.530 --&gt; 00:30:17.620
Miki Tebeka: So and and think about that.</p>
<p>246
00:30:19.520 --&gt; 00:30:24.550
Miki Tebeka: yeah, 1% per divide 5%. That's 2%</p>
<p>247
00:30:25.490 --&gt; 00:30:26.240
Miki Tebeka: So</p>
<p>248
00:30:29.630 --&gt; 00:30:50.819
Miki Tebeka: if you come to think about that, that has a lot of implications. This is, again, this intuition that we have that is usually wrong when it comes to these things. And the 3rd thing that they give this test to a lot of doctors. Most of them get it wrong. And then it means that they're actually basing treatments and other things on something which is</p>
<p>249
00:30:51.220 --&gt; 00:30:52.330
Miki Tebeka: not correct.</p>
<p>250
00:30:53.535 --&gt; 00:30:57.430
Miki Tebeka: So maybe they should run a simulation</p>
<p>251
00:30:57.600 --&gt; 00:31:00.239
Miki Tebeka: and get some understanding of what's going on.</p>
<p>252
00:31:01.920 --&gt; 00:31:02.710
Miki Tebeka: Alright.</p>
<p>253
00:31:03.700 --&gt; 00:31:11.480
Miki Tebeka: The last one that one is known as the Monty Hall problem. And we are</p>
<p>254
00:31:11.970 --&gt; 00:31:14.529
Miki Tebeka: okay. Well, we have a lot of time.</p>
<p>255
00:31:14.860 --&gt; 00:31:24.710
Miki Tebeka: I'll start speaking. No so the Monty Hall problem says, you're in a game show, and you have 3 doors.</p>
<p>256
00:31:24.990 --&gt; 00:31:30.829
Miki Tebeka: and the host says, you know, behind 2 doors there are goats.</p>
<p>257
00:31:31.030 --&gt; 00:31:36.710
Miki Tebeka: and behind and the 3rd door. There is a car that you can win.</p>
<p>258
00:31:37.980 --&gt; 00:31:43.140
Miki Tebeka: and it says, Pick a door, 1, 2, or 3, and you pick a door. Let's say I picked one.</p>
<p>259
00:31:43.270 --&gt; 00:31:50.730
Miki Tebeka: And now the host says, Okay, he goes on to another door. Let's say this time door number 2 opens the door and shows you a goat.</p>
<p>260
00:31:50.990 --&gt; 00:31:53.920
Miki Tebeka: And now, he says, do you want to keep</p>
<p>261
00:31:54.530 --&gt; 00:31:58.800
Miki Tebeka: your original door, or do you want to switch to the second one?</p>
<p>262
00:32:02.050 --&gt; 00:32:14.090
Miki Tebeka: Right? So you have a strategy now to say, now I pick door number one. I'm going to stay with door number one, or after they show me the door with the goat. I want to change my my answer, and I actually go on to pick door number 3.</p>
<p>263
00:32:14.660 --&gt; 00:32:19.359
Miki Tebeka: So what is the strategy? What is the good strategy to work in this case.</p>
<p>264
00:32:19.660 --&gt; 00:32:24.539
Miki Tebeka: Too late, is it? On, on one or 2?</p>
<p>265
00:32:25.203 --&gt; 00:32:35.650
Miki Tebeka: So again, we are going to simulate right? So a random door is around the manager now</p>
<p>266
00:32:36.020 --&gt; 00:32:45.589
Miki Tebeka: and here. What we're doing is that we pick, we say, does staying with the door wins the game right? So we</p>
<p>267
00:32:46.133 --&gt; 00:32:49.850
Miki Tebeka: we pick a 1 door which the car is</p>
<p>268
00:32:51.282 --&gt; 00:32:57.270
Miki Tebeka: under the door, and then we pick another door, which is the door that the player picked.</p>
<p>269
00:32:57.890 --&gt; 00:33:03.619
Miki Tebeka: Now, if the if it is the same door, it means that the player who says I'm going to stay</p>
<p>270
00:33:03.720 --&gt; 00:33:06.080
Miki Tebeka: is going to win the game.</p>
<p>271
00:33:07.580 --&gt; 00:33:14.870
Miki Tebeka: Okay? So I'm just saying, if the color is equal to the player door, then, meaning that the stay strategy wins.</p>
<p>272
00:33:15.240 --&gt; 00:33:17.310
Miki Tebeka: And now I'm going to</p>
<p>273
00:33:18.660 --&gt; 00:33:26.040
Miki Tebeka: to do a million simulations. I'm going to say that these are the number of wins that the stay strategy have.</p>
<p>274
00:33:26.160 --&gt; 00:33:31.619
Miki Tebeka: and these are the 9 number of wins that the switch strategy had.</p>
<p>275
00:33:31.880 --&gt; 00:33:38.980
Miki Tebeka: And I'm going to run the simulation, and then, if stay wins the game, I'm incrementing the stays. Otherwise</p>
<p>276
00:33:39.620 --&gt; 00:33:43.610
Miki Tebeka: I'm going to increment the wins, and then I</p>
<p>277
00:33:44.710 --&gt; 00:33:53.400
Miki Tebeka: divide them by N. So what is the fraction of the one? And I'm printing out? What is the fraction of times we want with staying, and what is the fraction of</p>
<p>278
00:33:53.980 --&gt; 00:33:58.640
Miki Tebeka: things that, and we did by switching</p>
<p>279
00:34:04.620 --&gt; 00:34:10.360
Miki Tebeka: any guesses which strategy is better.</p>
<p>280
00:34:20.120 --&gt; 00:34:25.560
Miki Tebeka: So if you switch door, you have twice as much chances of winning.</p>
<p>281
00:34:25.980 --&gt; 00:34:30.609
Miki Tebeka: Then then stay.</p>
<p>282
00:34:30.900 --&gt; 00:34:35.659
Miki Tebeka: And and this is really counterintuitive. Because why?</p>
<p>283
00:34:36.440 --&gt; 00:34:45.299
Miki Tebeka: Right? I picked the door at one dome. There could be a car under it. The fact that someone showed me another door that I didn't pick as a goat in it shouldn't change that.</p>
<p>284
00:34:45.790 --&gt; 00:34:47.720
Miki Tebeka: But it actually does.</p>
<p>285
00:34:47.920 --&gt; 00:34:56.590
Miki Tebeka: And there's a lot of debate. If you Google, the multi-hole problem with statistics, there's a lot of debate about</p>
<p>286
00:34:56.710 --&gt; 00:34:59.999
Miki Tebeka: what? What does it mean? And and</p>
<p>287
00:35:00.260 --&gt; 00:35:10.629
Miki Tebeka: are these calculations okay or not. For me. Yeah, it's I have a strategy now. So if I see a goat, they pick the other door. And that's it.</p>
<p>288
00:35:14.370 --&gt; 00:35:18.649
Miki Tebeka: Okay? So you can read more on Wikipedia, on the Monty Hall problem and it.</p>
<p>289
00:35:19.582 --&gt; 00:35:25.739
Miki Tebeka: so that's that's basically these are the 4 cases. And I hope I convince you that</p>
<p>290
00:35:26.660 --&gt; 00:35:30.299
Miki Tebeka: when you have these questions don't don't shy away because you don't know the map</p>
<p>291
00:35:30.470 --&gt; 00:35:40.719
Miki Tebeka: because you don't know how to figure out statistics, and a lot of time will also help you, because a lot of time, our intuition, when it comes to statistics and probabilities, is usually wrong.</p>
<p>292
00:35:41.280 --&gt; 00:35:50.400
Miki Tebeka: We, we, as people, have good intuition about small numbers when it comes to large numbers, we are really</p>
<p>293
00:35:51.569 --&gt; 00:35:53.129
Miki Tebeka: very bad at that.</p>
<p>294
00:35:53.827 --&gt; 00:36:04.760
Miki Tebeka: There, there's 1 statistic and and nothing tallible says that every time he works with statistics he need to turn off this part of the brain that says, I know what I'm doing, and just pass the numbers</p>
<p>295
00:36:05.050 --&gt; 00:36:17.420
Miki Tebeka: right? So if you want to learn more, there is a great talk by Jake Vanderplass is an astrophysicist who's heavily involved in scientific python community.</p>
<p>296
00:36:17.850 --&gt; 00:36:35.519
Miki Tebeka: and he has that's where that's where I started, and he shows some other simulations and how to do statistics. You can read more about the multicolor simulation in in Wikipedia. By the way, I think even in Google sheets and and excel, you can run</p>
<p>297
00:36:36.570 --&gt; 00:36:42.519
Miki Tebeka: Monte Carlo simulations, which is pretty awesome, or in excel</p>
<p>298
00:36:46.570 --&gt; 00:36:50.369
Miki Tebeka: in in excel. Now we have python in excel. Right? So this is</p>
<p>299
00:36:50.760 --&gt; 00:36:54.949
Miki Tebeka: great, and there's a library called Simpy.</p>
<p>300
00:36:55.100 --&gt; 00:37:10.809
Miki Tebeka: If you want to do what is known as a discrete simulation. And Simpa, basically for every tick tell, you tell you process to do something, and then you can simulate cars going and people crossing the road and cell phone towers, and a lot of, and a lot of other things.</p>
<p>301
00:37:14.760 --&gt; 00:37:23.070
Miki Tebeka: Zia, I'm not sure what your name is. But, he said, the simulation will not help you if the problem is not well specified, and that that is true.</p>
<p>302
00:37:23.360 --&gt; 00:37:30.109
Miki Tebeka: Right. So you need a good definition of the problem before you start. If you have a vague definition of the problem, then</p>
<p>303
00:37:30.650 --&gt; 00:37:31.560
Miki Tebeka: now</p>
<p>304
00:37:40.240 --&gt; 00:37:48.350
Miki Tebeka: you, you can think about it any way you want. I'm not sure why you're saying that the Monty Hall problem is not fully defined, but we can talk about it later.</p>
<p>305
00:37:50.230 --&gt; 00:37:52.989
Miki Tebeka: Because this is just about picking a strategy to win.</p>
<p>306
00:37:53.100 --&gt; 00:37:55.940
Miki Tebeka: And I think the the switch one is is a winning strategy.</p>
<p>307
00:37:58.290 --&gt; 00:38:03.219
Miki Tebeka: That's it. All of this code is in my github</p>
<p>308
00:38:03.490 --&gt; 00:38:11.880
Miki Tebeka: talks repo, so you can look at the code there. All all the things that that they have there.</p>
<p>309
00:38:13.050 --&gt; 00:38:20.920
Miki Tebeka: I wrote a book on Python with quizzes, if you want to to buy it, and if you have questions, that's a good time to ask them. Not</p>
<p>310
00:38:28.840 --&gt; 00:38:30.660
Miki Tebeka: no question. Gabor.</p>
<p>311
00:38:30.970 --&gt; 00:38:36.500
Gabor Szabo: Well, I don't have any question either. Now. I already asked the ones I had.</p>
<p>312
00:38:36.620 --&gt; 00:38:40.469
Gabor Szabo: I just want to thank you and thank everyone who who joined us.</p>
<p>313
00:38:41.320 --&gt; 00:38:42.590
Gabor Szabo: And</p>
<p>314
00:38:44.080 --&gt; 00:38:56.819
Gabor Szabo: if you are watching the video and you reach this point, then please remember to like the video and follow the channel. And under the video you will find a link to where you will have the link to this Github page.</p>
<p>315
00:38:56.930 --&gt; 00:39:05.280
Gabor Szabo: To this Github Repo, and you will be able to also find find Mickey. I guess you also.</p>
<p>316
00:39:05.730 --&gt; 00:39:06.530
Miki Tebeka: Yeah, yeah, sure.</p>
<p>317
00:39:06.530 --&gt; 00:39:07.200
Gabor Szabo: Share your link.</p>
<p>318
00:39:07.200 --&gt; 00:39:09.780
Miki Tebeka: If you have any questions I will answer.</p>
<p>319
00:39:11.540 --&gt; 00:39:15.230
Gabor Szabo: Okay, so thank you very much.</p>
<p>320
00:39:15.510 --&gt; 00:39:16.870
Miki Tebeka: So more for organizing this.</p>
<p>321
00:39:17.710 --&gt; 00:39:21.050
Gabor Szabo: You're welcome, and I hope to see you in other presentations.</p>
<p>322
00:39:21.050 --&gt; 00:39:22.110
Miki Tebeka: Awesome. Thank you.</p>
<p>323
00:39:22.110 --&gt; 00:39:22.770
Gabor Szabo: Bye, bye.</p>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>Python async.io - From zero to hero with Eyal Balla</title>
    <summary type="html"><![CDATA[]]></summary>
    <updated>2025-02-13T13:10:01Z</updated>
    <pubDate>2025-02-13T13:10:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/async-io-from-zero-to-hero" />
    <id>https://python.code-maven.com/async-io-from-zero-to-hero</id>
    <content type="html"><![CDATA[<p>Asyncio is a cool part of python. But what does it do?
It is a way to write async code in Python.</p>
<p>This lecture shows use real world use cases, knowhows and troubleshooting methods for using asyncio in python</p>
<p><a href="https://www.linkedin.com/in/eyal-balla/">Eyal Balla</a></p>
<p><img src="images/eyal-balla.jpeg" alt="Eyal Balla" /></p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/aNYSG_HcX0g" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<h2 class="title is-4">Transcript</h2>
<p>1
00:00:01.740 --&gt; 00:00:30.889
Gabor Szabo: So Hello, and welcome to the Codemavens, Meetup Group and Codemavens Channel. If you're watching it on on Youtube, my name is Gabor Sabo. I usually teach python and rust at companies and also introduce test automation and Ci, and that the area and I also think that sharing knowledge is extremely important among high tech</p>
<p>2
00:00:31.470 --&gt; 00:00:56.010
Gabor Szabo: programmers and and people working in the high tech industry. So that's why I am organizing these these meetings, these presentations. As you can see it's being recorded. It's going to be on Youtube, please, like the video and follow the channel below the video, you will find a link to information about Al and and about the content of this video.</p>
<p>3
00:00:56.130 --&gt; 00:01:03.549
Gabor Szabo: And I would like to welcome everyone who's who joined us in the live meeting, and especially like Eyal, giving us the presentation.</p>
<p>4
00:01:03.820 --&gt; 00:01:11.349
Gabor Szabo: So now it's yours. Please introduce yourself and share the screen as you feel fit and welcome.</p>
<p>5
00:01:12.590 --&gt; 00:01:30.479
Eyal Balla: Thank you. So I'll share the screen. So this is presentation about it's the it's like a take off on a presentation I did like, I think, 2 years ago, or maybe 3 years ago.</p>
<p>6
00:01:33.070 --&gt; 00:01:42.467
Eyal Balla: bit about me. So I've been developing for more than 15 years and working in python like</p>
<p>7
00:01:43.120 --&gt; 00:01:49.530
Eyal Balla: 5 to 10 years. And currently, I lead the data team in scenario.</p>
<p>8
00:01:50.390 --&gt; 00:01:51.629
Eyal Balla: See? That's me.</p>
<p>9
00:01:52.994 --&gt; 00:02:04.040
Eyal Balla: So you can find me in the links, and if you're interested we're also hiring. So you're welcome to try and join. And we're looking for people that</p>
<p>10
00:02:04.520 --&gt; 00:02:08.080
Eyal Balla: working python, and that is their passion.</p>
<p>11
00:02:08.650 --&gt; 00:02:13.000
Eyal Balla: So today, what I'm gonna do is I'm gonna go through</p>
<p>12
00:02:13.340 --&gt; 00:02:29.884
Eyal Balla: a bit about what Asyncayo is and try to give a real world example from things that we do, and then we'll try and talk about some advanced topics regarding Asyncayo. And so that's what we're gonna do.</p>
<p>13
00:02:31.060 --&gt; 00:02:36.420
Eyal Balla: I think it's important that you guys feel free to step in and ask questions if you need</p>
<p>14
00:02:38.620 --&gt; 00:02:45.520
Eyal Balla: and cause there, there's gonna be a bit code and some topics, and maybe</p>
<p>15
00:02:45.730 --&gt; 00:02:52.010
Eyal Balla: hopefully, it's gonna be all clear. But if somebody has any question, then feel free to jump in.</p>
<p>16
00:02:52.550 --&gt; 00:03:00.030
Eyal Balla: So 1st of all, what what is Asyncaio. So Asyncaio is a style of concurrent programming in python.</p>
<p>17
00:03:00.340 --&gt; 00:03:04.660
Eyal Balla: So why do we need it? So you can think of</p>
<p>18
00:03:04.960 --&gt; 00:03:08.569
Eyal Balla: wanting to do multiple things in python at the same time.</p>
<p>19
00:03:08.770 --&gt; 00:03:09.506
Eyal Balla: So</p>
<p>20
00:03:11.140 --&gt; 00:03:19.540
Eyal Balla: a simple way to do it is using a fork right? So you can run multiple python processes at the same time.</p>
<p>21
00:03:20.010 --&gt; 00:03:43.789
Eyal Balla: So the OS handles the concurrency. And you can actually use multi-cores on your machines. And the problem is that you get duplicated memory because each process has its own memory space right? And in order to communicate between the different python processes you need like OS level communications. So pipes and</p>
<p>22
00:03:43.950 --&gt; 00:03:45.463
Eyal Balla: files, and</p>
<p>23
00:03:47.161 --&gt; 00:03:59.079
Eyal Balla: sorry other ways of multi-process communication. So you'd say, Okay, maybe we can do it some some other way. So there's also multi-threading in python.</p>
<p>24
00:03:59.390 --&gt; 00:04:02.710
Eyal Balla: So this is nice. So you can create new thread. And</p>
<p>25
00:04:03.242 --&gt; 00:04:11.339
Eyal Balla: it looks like you can run multiple things at the same time. But then there's Gil. So Gil is the global interpreter lock.</p>
<p>26
00:04:11.878 --&gt; 00:04:17.962
Eyal Balla: I know there's an agenda in python to try and remove it. But for now it's there</p>
<p>27
00:04:19.910 --&gt; 00:04:25.750
Eyal Balla: And Gil prevents multiple commands of python running in the same process at the same time.</p>
<p>28
00:04:25.860 --&gt; 00:04:34.925
Eyal Balla: So always concurrency for threading is done when you do. things like</p>
<p>29
00:04:35.580 --&gt; 00:04:41.672
Eyal Balla: accessing files or network, or anytime they give access from the your</p>
<p>30
00:04:42.660 --&gt; 00:04:46.390
Eyal Balla: python commands into the OS. This is done.</p>
<p>31
00:04:47.221 --&gt; 00:04:59.660
Eyal Balla: There's something you don't do implicitly. You do it. You can't do it explicitly. You can only do it implicitly by accessing something or doing something that requires OS interaction.</p>
<p>32
00:05:00.720 --&gt; 00:05:12.629
Eyal Balla: And there's Asyncayo. So what is Asyncaio? It's an I/O event manager. Okay? And it, helps you manage states. So you can have multiple states of your system</p>
<p>33
00:05:12.750 --&gt; 00:05:24.389
Eyal Balla: on the same thread. And you can actually explicitly manage the context switching. So you can say, I want to work on multiple items. These are the multiple items. And I want to work on them.</p>
<p>34
00:05:24.920 --&gt; 00:05:33.670
Eyal Balla: So if we look at high level at what the options are for multiprocessing. Say, we have like multiprocesses.</p>
<p>35
00:05:33.780 --&gt; 00:05:47.120
Eyal Balla: so you can have concurrency, and you can use all the cpus. But you know you can't run many processes on the same machine, because each uses a full CPU, so maybe</p>
<p>36
00:05:47.740 --&gt; 00:05:52.530
Eyal Balla: one to 10 cpus and processes.</p>
<p>37
00:05:52.800 --&gt; 00:06:01.410
Eyal Balla: And you can use, generally, the the standard library blocking components and synchronization tools.</p>
<p>38
00:06:02.110 --&gt; 00:06:16.011
Eyal Balla: And then if you need something that's maybe a bit higher on the scalability you can use threads so you can. You have a single process, and Gil is protecting you from</p>
<p>39
00:06:16.910 --&gt; 00:06:22.843
Eyal Balla: doing things between threads which which ha! Which touch memory, and</p>
<p>40
00:06:23.670 --&gt; 00:06:34.039
Eyal Balla: in intrusive way. And but the problem is that you let the OS think your code and you can't really</p>
<p>41
00:06:34.792 --&gt; 00:06:36.720
Eyal Balla: control it manually.</p>
<p>42
00:06:37.070 --&gt; 00:06:51.839
Eyal Balla: And then there's asking where you can actually handle multiple thousands of scalable small components you call them quote coroutines and but pro, and and it's in the application level.</p>
<p>43
00:06:51.940 --&gt; 00:06:58.970
Eyal Balla: So this is what we're gonna look at today. And we're gonna see how it's work, how it works and how you can control it.</p>
<p>44
00:07:00.620 --&gt; 00:07:03.188
Eyal Balla: So like every other</p>
<p>45
00:07:04.440 --&gt; 00:07:09.849
Eyal Balla: program in the world, there's the hello world like in for asking Kaya. Right? So there's</p>
<p>46
00:07:12.010 --&gt; 00:07:15.570
Eyal Balla: Let's see. Can you see my cursor, then you can.</p>
<p>47
00:07:16.310 --&gt; 00:07:23.630
Eyal Balla: So there's a a regular hello world. And there's the one with asking Kyle, wait, okay, but</p>
<p>48
00:07:23.900 --&gt; 00:07:29.589
Eyal Balla: this program is not really helpful, right? Because it doesn't show anything that's important for us.</p>
<p>49
00:07:30.210 --&gt; 00:07:50.489
Eyal Balla: So what what do we want to do with? I think in the real world? So we want to use it for handling multiple heavy I/O processes. So like maybe database accesses or multiple web requests or file sharing or accessing many I/O components at the same time.</p>
<p>50
00:07:51.330 --&gt; 00:08:10.769
Eyal Balla: And you can always use as incaio using multi processes. So you can have. maybe in a cloud application. You can have multiple pods right? And but also you can run it on multiple processes and have have the ability to use multiple cpus if needed.</p>
<p>51
00:08:13.430 --&gt; 00:08:14.330
Eyal Balla: But</p>
<p>52
00:08:14.440 --&gt; 00:08:28.290
Eyal Balla: the downside of Asyncago is that it's almost a different programming language. It looks like python, and it the constructs are very much like Python, and you just use like a few more keywords. But</p>
<p>53
00:08:28.810 --&gt; 00:08:36.129
Eyal Balla: it's very different in concept, because each coroutine the functions that you call with the weight</p>
<p>54
00:08:37.038 --&gt; 00:08:45.999
Eyal Balla: they have to be short enough to allow multiple contacts to run together. So you mustn't use</p>
<p>55
00:08:46.528 --&gt; 00:08:56.480
Eyal Balla: long computations. You can't block the event. Queue. Okay, just like in iui application. You don't want the the main loop to be blocked.</p>
<p>56
00:08:56.890 --&gt; 00:09:10.859
Eyal Balla: And also you can't use general purpose OS blocking commands like, create connections for socket, or select or sleep. So you have things that are, I think, ios specific</p>
<p>57
00:09:13.062 --&gt; 00:09:20.630
Eyal Balla: so you even have like different libraries that you can use in Asyncaio. So</p>
<p>58
00:09:21.888 --&gt; 00:09:33.809
Eyal Balla: if you usually use requests, I suggest you try use Httpx. It has a better behavior fast Api over Django and flask.</p>
<p>59
00:09:34.671 --&gt; 00:09:37.929
Eyal Balla: Mpg, instead of psychopg, etcetera.</p>
<p>60
00:09:38.130 --&gt; 00:09:40.760
Eyal Balla: Okay, so</p>
<p>61
00:09:42.130 --&gt; 00:09:55.740
Eyal Balla: if we look at Asyncaio at the main building blocks, what we have is, we have the the main Asyncaio run command. So what it does it receives in a</p>
<p>62
00:09:57.490 --&gt; 00:10:05.362
Eyal Balla: in a sync way, it received in a sync context, it receives a core routine, something that</p>
<p>63
00:10:05.900 --&gt; 00:10:09.830
Eyal Balla: is run on the asicio loop and creates a loop</p>
<p>64
00:10:09.990 --&gt; 00:10:16.710
Eyal Balla: and runs the core routine in it. And usually this is how you do the entry point into Asyncaya context.</p>
<p>65
00:10:17.300 --&gt; 00:10:20.779
Eyal Balla: Okay? And then you have the core routines. This, these look like,</p>
<p>66
00:10:21.330 --&gt; 00:10:26.570
Eyal Balla: I define Async Def and a function. And this creates a core routine.</p>
<p>67
00:10:26.770 --&gt; 00:10:31.479
Eyal Balla: Okay and core routines. You can run either using domain loop</p>
<p>68
00:10:31.600 --&gt; 00:10:38.720
Eyal Balla: or create a Co a context, a task using the Asyncare create task.</p>
<p>69
00:10:39.560 --&gt; 00:10:40.470
Eyal Balla: Okay?</p>
<p>70
00:10:40.740 --&gt; 00:10:49.319
Eyal Balla: And also when you want to wait for something to happen, and you want to release the context. So you run, await.</p>
<p>71
00:10:49.480 --&gt; 00:10:59.880
Eyal Balla: call, wait in the call routine, and then you wait for the you wait. The context itself waits until the Async context is finished</p>
<p>72
00:11:00.260 --&gt; 00:11:04.519
Eyal Balla: and then returns the control to the main loop that calls this.</p>
<p>73
00:11:04.620 --&gt; 00:11:05.430
Eyal Balla: Okay?</p>
<p>74
00:11:06.350 --&gt; 00:11:07.370
Eyal Balla: So</p>
<p>75
00:11:08.022 --&gt; 00:11:21.649
Eyal Balla: I'm gonna show you guys an example of a small program. And so what this does it. This is a synchronous program. It reads from S. 3, and then queries some database. Right?</p>
<p>76
00:11:22.040 --&gt; 00:11:23.960
Eyal Balla: So we have, like a</p>
<p>77
00:11:24.090 --&gt; 00:11:30.433
Eyal Balla: 2 contexts. One. This is the 1st one. This is second one, and they're not.</p>
<p>78
00:11:31.700 --&gt; 00:11:48.750
Eyal Balla: they're not dependent on each other. You can see the it just gets a file, and then just runs a query. And seeing we want to try and do this together because we want to have the context returned with the content itself. But we don't have some kind of connection between the 2 contexts.</p>
<p>79
00:11:49.460 --&gt; 00:11:56.740
Eyal Balla: So what you can do is you can move to Asyncayo, define this as a coroutine using the</p>
<p>80
00:11:57.150 --&gt; 00:11:58.716
Eyal Balla: aio bottle.</p>
<p>81
00:12:00.194 --&gt; 00:12:10.339
Eyal Balla: Async library and using Asynpg use, create, run the create a quoting from the query of the database.</p>
<p>82
00:12:10.660 --&gt; 00:12:21.090
Eyal Balla: and then you can use gather to run the 2 core routines together. Okay, independently, one of each other, one from each other. So when</p>
<p>83
00:12:21.826 --&gt; 00:12:30.723
Eyal Balla: the execution of the query runs the return of the items, and the body and of the file is done. This is done</p>
<p>84
00:12:31.300 --&gt; 00:12:40.689
Eyal Balla: asynchronously, and while waiting for the I/O, the the context is continued to the other con. Other part.</p>
<p>85
00:12:41.160 --&gt; 00:12:42.000
Eyal Balla: Okay.</p>
<p>86
00:12:43.430 --&gt; 00:12:46.690
Eyal Balla: Questions great.</p>
<p>87
00:12:47.450 --&gt; 00:13:16.690
Eyal Balla: So more more options. Also supports context managers. So this is very convenient because you can create. For instance, in this you can create, you look at the the bottom part here I had to connect and then close using finally. But I can also create a context manager from this and open the connection with the context manager exits close the connection. So I don't have to use</p>
<p>88
00:13:16.710 --&gt; 00:13:19.260
Eyal Balla: explicit exception handling.</p>
<p>89
00:13:19.850 --&gt; 00:13:21.320
Eyal Balla: And also</p>
<p>90
00:13:22.012 --&gt; 00:13:33.339
Eyal Balla: I think I supports iterators so you can. use like generator way of controlling small parts of the code</p>
<p>91
00:13:33.848 --&gt; 00:13:43.049
Eyal Balla: one after the other, and using Async interval. So this is these are patents very known in Python, and you can also use them in Async context.</p>
<p>92
00:13:45.200 --&gt; 00:13:50.409
Eyal Balla: So now I want to show you guys, maybe a problem from our day to day work.</p>
<p>93
00:13:50.870 --&gt; 00:13:56.281
Eyal Balla: I'll present the problem first.st So what we want to do is we want to do some</p>
<p>94
00:13:58.582 --&gt; 00:14:21.089
Eyal Balla: in integration which reads data from an external source and then enriches it. Okay, gets information from maybe a database, and then adds to the context from the external source, and then writes the results into our database as entities. Okay? And I think the the main issue here is that maybe</p>
<p>95
00:14:21.659 --&gt; 00:14:28.550
Eyal Balla: we have multiple customers. Some are small, some are large, and customers have maybe</p>
<p>96
00:14:29.200 --&gt; 00:14:44.130
Eyal Balla: tens of thousands of entities. So there's a lot of reading from the external source, and also maybe a lot of writing and reading into the database. So we have a lot of I/O. And this actually fits very well the Asyncaio concepts.</p>
<p>97
00:14:44.460 --&gt; 00:14:48.700
Eyal Balla: Okay, so what do we want to do?</p>
<p>98
00:14:49.388 --&gt; 00:15:01.639
Eyal Balla: We wanna call something once in a while and go over each of the customers, get the information and then update with the enriched information into our database. Okay? So</p>
<p>99
00:15:01.810 --&gt; 00:15:06.059
Eyal Balla: like an even nave implementation would be something like this.</p>
<p>100
00:15:07.620 --&gt; 00:15:13.880
Eyal Balla: You define all the the the bootstraps</p>
<p>101
00:15:14.549 --&gt; 00:15:37.869
Eyal Balla: needed, and then you get the list of customers. And then for each customer, you do the information, the enrichment. So you get the settings and you run the enricher on the information, and then you in per customer. You get the information from the integration system and enrich it and write it into your database. Right?</p>
<p>102
00:15:38.640 --&gt; 00:15:40.020
Eyal Balla: So this is nice.</p>
<p>103
00:15:40.935 --&gt; 00:15:41.740
Eyal Balla: But</p>
<p>104
00:15:42.558 --&gt; 00:15:54.589
Eyal Balla: the problem with that when we look at this. So this runs per per customer. So that means that until one customer is done the other, the next customer doesn't start.</p>
<p>105
00:15:54.770 --&gt; 00:15:59.140
Eyal Balla: Okay? So if we have small customers and large customers, then</p>
<p>106
00:16:00.190 --&gt; 00:16:06.090
Eyal Balla: we have a problem that small customers are impacted by the size of large customers right?</p>
<p>107
00:16:06.910 --&gt; 00:16:17.090
Eyal Balla: And also, once we have something that's bigger than the the total chrome time, then it doesn't actually</p>
<p>108
00:16:17.991 --&gt; 00:16:30.509
Eyal Balla: it's not actually up to the time it's called time. So system is not up to the functionality according to the time constraints that supposed to run through.</p>
<p>109
00:16:32.395 --&gt; 00:16:36.110
Eyal Balla: So what can we do so. I think the 1st thing we can do</p>
<p>110
00:16:36.310 --&gt; 00:16:48.919
Eyal Balla: is separated per customer. So we can have, some kind of injection of the customer id through a queue and have the system run only per customer. So</p>
<p>111
00:16:49.160 --&gt; 00:16:54.019
Eyal Balla: it reads the information from the queue gets the customer id</p>
<p>112
00:16:54.380 --&gt; 00:17:01.040
Eyal Balla: here, and then runs the same thing just for a specific customer.</p>
<p>113
00:17:01.160 --&gt; 00:17:03.090
Eyal Balla: So how does this help? So</p>
<p>114
00:17:03.310 --&gt; 00:17:10.749
Eyal Balla: what we can do now is we can scale out. So we can have multiple instances of this specific code run</p>
<p>115
00:17:12.778 --&gt; 00:17:19.050
Eyal Balla: together each on a different customer, and assuming that they're not dependent then.</p>
<p>116
00:17:19.699 --&gt; 00:17:27.270
Eyal Balla: Small customers are now not impacted by the size of large customers and the number of the the time that you wanna</p>
<p>117
00:17:27.829 --&gt; 00:17:29.460
Eyal Balla: to run. This is.</p>
<p>118
00:17:30.045 --&gt; 00:17:40.390
Eyal Balla: at most, the time of the biggest customer. Right? So you can scale out as much as you want, and the time that this whole process takes is the time of the biggest customer.</p>
<p>119
00:17:41.270 --&gt; 00:17:42.110
Eyal Balla: Okay?</p>
<p>120
00:17:42.890 --&gt; 00:17:49.890
Eyal Balla: So till now we did not touch anything. That's as Asyncare, right? So we just used like a simple</p>
<p>121
00:17:50.600 --&gt; 00:17:56.219
Eyal Balla: design patterns that allow scaling out of of loops.</p>
<p>122
00:17:57.300 --&gt; 00:18:02.860
Eyal Balla: So now let's try and use Asyncayo to improve the performance of this whole loop.</p>
<p>123
00:18:03.030 --&gt; 00:18:05.249
Eyal Balla: So what do we do?</p>
<p>124
00:18:05.887 --&gt; 00:18:09.710
Eyal Balla: We create a core routine and run it, using Asikaya run.</p>
<p>125
00:18:11.583 --&gt; 00:18:17.220
Eyal Balla: This coroutine is very much similar to what we ran before.</p>
<p>126
00:18:19.340 --&gt; 00:18:25.209
Eyal Balla: Except that now when we look at what happens inside the run for customer.</p>
<p>127
00:18:25.450 --&gt; 00:18:27.550
Eyal Balla: this looks a bit different.</p>
<p>128
00:18:27.810 --&gt; 00:18:31.698
Eyal Balla: So what do we do? First, st we create</p>
<p>129
00:18:33.466 --&gt; 00:18:44.300
Eyal Balla: we run through the pages. Okay? And when we want to enrich the items, we create batches of coroutines, and then we run them together.</p>
<p>130
00:18:44.420 --&gt; 00:18:59.979
Eyal Balla: Okay, so here. The coroutines are created according to the number of the integration items, and when you enrich and read the information each each batch of the</p>
<p>131
00:19:00.460 --&gt; 00:19:04.559
Eyal Balla: coroutines runs together so you can wait for</p>
<p>132
00:19:06.570 --&gt; 00:19:11.189
Eyal Balla: So so they happen together and wait only for the I/O for each of the items.</p>
<p>133
00:19:11.760 --&gt; 00:19:12.580
Eyal Balla: Okay,</p>
<p>134
00:19:16.630 --&gt; 00:19:20.870
Naty Harary: Yeah, I have a question, and just should I interrupt you? Mid sentence?</p>
<p>135
00:19:20.870 --&gt; 00:19:21.420
Eyal Balla: Yeah.</p>
<p>136
00:19:22.394 --&gt; 00:19:29.490
Naty Harary: So, as far as I know, in Async I/O, it is enough to mark the function itself. Async.</p>
<p>137
00:19:29.700 --&gt; 00:19:34.740
Naty Harary: you just wait for it. So I'm not really familiar with the syntax like here.</p>
<p>138
00:19:35.258 --&gt; 00:19:40.680
Naty Harary: So why do we need to ask Sync this as well? I'm not really sure I understand.</p>
<p>139
00:19:42.430 --&gt; 00:19:48.560
Eyal Balla: What we're doing here is we wait for for each of the pages. So this is a nastic iterator, right?</p>
<p>140
00:19:48.940 --&gt; 00:19:53.389
Eyal Balla: But this, this is. This is the core routine which returns an sn key iterator.</p>
<p>141
00:19:53.510 --&gt; 00:19:56.173
Eyal Balla: And when when you</p>
<p>142
00:19:57.270 --&gt; 00:20:06.239
Eyal Balla: what each of these pages contains items, so you want to enrich each of the items. So you create core teams for each of the items to be enriched.</p>
<p>143
00:20:06.390 --&gt; 00:20:16.100
Eyal Balla: And when you run them you run them using gathers because we can when you run a weight on something. Okay, this makes the the</p>
<p>144
00:20:16.570 --&gt; 00:20:24.700
Eyal Balla: higher level function. Wait till it's done. Okay, this is a way to synchronize as in context.</p>
<p>145
00:20:25.220 --&gt; 00:20:29.610
Eyal Balla: Okay, so here you synchronize multiple as in context, using gather.</p>
<p>146
00:20:31.990 --&gt; 00:20:41.489
Naty Harary: I see. So you just gather all the chunks that you have, and you create them with the asset iterator rather than just give a big function and just asking on that right? That's.</p>
<p>147
00:20:41.490 --&gt; 00:21:00.950
Eyal Balla: Because you want to split your context into smaller processing units, each of them may be I/O bound so together the I would run together on each of the in parallel. On each of the items.</p>
<p>148
00:21:01.940 --&gt; 00:21:03.429
Naty Harary: Got it. Thank you.</p>
<p>149
00:21:06.130 --&gt; 00:21:07.540
Eyal Balla: Okay. So</p>
<p>150
00:21:08.060 --&gt; 00:21:27.009
Eyal Balla: now, like I said before, so enrichment haps happens in parallel. And but still you can scale out. So you can have the multiple services. And so the total performance here is, not blocked. And also small customers are not impacted by the large customers.</p>
<p>151
00:21:30.120 --&gt; 00:21:34.040
Eyal Balla: Okay. So some other things you should take into consideration.</p>
<p>152
00:21:34.300 --&gt; 00:21:45.839
Eyal Balla: So I think the 1st thing is exception handling. So when you create as in context, you sometimes need to handle exception in the the top level.</p>
<p>153
00:21:46.000 --&gt; 00:22:04.660
Eyal Balla: So when you do that you can register manual exception handler, so you get the the main loop and set the exception handler, and you can handle the main the errors from that are created from each of the tasks</p>
<p>154
00:22:06.276 --&gt; 00:22:16.647
Eyal Balla: separately, because if you don't do that, then the the Asyncaio context would ha would behave</p>
<p>155
00:22:18.610 --&gt; 00:22:27.379
Eyal Balla: it. It would exit on each of the when each when one of the sub core teams throws an exception into the main context.</p>
<p>156
00:22:27.760 --&gt; 00:22:28.620
Eyal Balla: Okay?</p>
<p>157
00:22:28.810 --&gt; 00:22:34.353
Eyal Balla: So sometimes you want to wait, maybe maybe for the last one, or</p>
<p>158
00:22:35.440 --&gt; 00:22:41.499
Eyal Balla: perhaps some other behavior that is specific for your system. And you can do this this way</p>
<p>159
00:22:42.607 --&gt; 00:22:54.359
Eyal Balla: specifically for gather you can have 2 ways to handle exceptions, so you can do it. Inside each of the core routines like we I did before, or you can</p>
<p>160
00:22:54.610 --&gt; 00:23:16.869
Eyal Balla: ask the gather to collect all the exceptions from each of the core routines, and then you can handle errors together. For instance, if you want to have some retry mechanism, then this is a good way to do it. So you gather all the errors, and then you can retry all those that failed, or decide to do whatever you want to do with those that did not succeed.</p>
<p>161
00:23:19.144 --&gt; 00:23:29.915
Eyal Balla: Regarding testing. So I think, if you look at this core routines what I want to test here is I want to test, maybe.</p>
<p>162
00:23:32.830 --&gt; 00:23:40.820
Eyal Balla: functional response. Okay, something like a happy path and maybe an exception to test the raise for status.</p>
<p>163
00:23:42.300 --&gt; 00:23:50.960
Eyal Balla: So the important part is to mark your test as pi test mark asking Kyle, so this allows you to run test in asking context.</p>
<p>164
00:23:51.770 --&gt; 00:23:57.170
Eyal Balla: There's like Htpx gives Htpx mock. So you can use that.</p>
<p>165
00:23:57.300 --&gt; 00:24:04.819
Eyal Balla: And and then you can inject the response. Here, for instance, and test your happy flow.</p>
<p>166
00:24:05.150 --&gt; 00:24:17.508
Eyal Balla: and also you can always use pytest raises like you did before. And assuming you marked as Asyncio, you can. Test the flow of async and</p>
<p>167
00:24:18.130 --&gt; 00:24:19.460
Eyal Balla: exception flow.</p>
<p>168
00:24:19.720 --&gt; 00:24:20.520
Eyal Balla: Okay?</p>
<p>169
00:24:22.605 --&gt; 00:24:23.380
Eyal Balla: Sorry.</p>
<p>170
00:24:24.240 --&gt; 00:24:31.040
Eyal Balla: What you can also do is there's Async mock like a unit test magic mock</p>
<p>171
00:24:32.065 --&gt; 00:24:40.340
Eyal Balla: you can mock coroutines. So here's an example of how you mark a coroutines and test it. So</p>
<p>172
00:24:40.510 --&gt; 00:24:48.583
Eyal Balla: this is something that's nice to know, and I think it's very valuable when you're testing and mocking</p>
<p>173
00:24:50.250 --&gt; 00:24:51.110
Eyal Balla: coroutines</p>
<p>174
00:24:51.798 --&gt; 00:25:02.409
Eyal Balla: I think that today the default patch returns in a Async context, or a mock on a magic mark or an Async mark, according to the</p>
<p>175
00:25:02.938 --&gt; 00:25:09.211
Eyal Balla: type of function that it gets. So if this is the core routine, then this would return.</p>
<p>176
00:25:10.050 --&gt; 00:25:17.320
Eyal Balla: create this as an Async mark, and if not, it be a magic mark according to what is needed.</p>
<p>177
00:25:18.980 --&gt; 00:25:24.740
Eyal Balla: Something else that's very important for developers is the ability to debug.</p>
<p>178
00:25:25.240 --&gt; 00:25:33.950
Eyal Balla: So I think Kyle gives a debug mode when you run with the environment variable. You get</p>
<p>179
00:25:34.473 --&gt; 00:25:43.669
Eyal Balla: also track, trace track trace backs on asking functions when they're not awaited. So you can find out where where this happens, and when</p>
<p>180
00:25:44.410 --&gt; 00:25:57.470
Eyal Balla: and also this monitors thread safety. So when you, behave, something in your system behaves unsafe regarding the different core routines and the memory they touch.</p>
<p>181
00:25:57.900 --&gt; 00:26:08.020
Eyal Balla: So you get errors in your in your logs. And also this helps debugs debug slow core routines. Because.</p>
<p>182
00:26:08.592 --&gt; 00:26:10.857
Eyal Balla: I think Cayo is very</p>
<p>183
00:26:12.057 --&gt; 00:26:24.449
Eyal Balla: very sensitive to having core routines and long coroutines blocking short coroutines. So this actually helps you understand better the flow of your code once you use Async I/O</p>
<p>184
00:26:26.425 --&gt; 00:26:38.450
Eyal Balla: so this is how a slow log look looks like. So if I do something very slow, you'd get like a log, saying, this is this has taken too long. Okay? So you would know that</p>
<p>185
00:26:38.890 --&gt; 00:26:40.749
Eyal Balla: you want to look at dysfunction.</p>
<p>186
00:26:42.468 --&gt; 00:26:46.551
Eyal Balla: Also something you want you might want to consider is</p>
<p>187
00:26:47.599 --&gt; 00:26:52.540
Eyal Balla: having something that solvers running in your context in your services.</p>
<p>188
00:26:53.396 --&gt; 00:27:01.890
Eyal Balla: So aio debug allows you to log slow callbacks inside your production pods.</p>
<p>189
00:27:02.010 --&gt; 00:27:08.362
Eyal Balla: And this you can enable a specific</p>
<p>190
00:27:09.420 --&gt; 00:27:12.209
Eyal Balla: callbacks when this happens. And this is</p>
<p>191
00:27:12.340 --&gt; 00:27:14.906
Eyal Balla: really great, because this has</p>
<p>192
00:27:16.144 --&gt; 00:27:20.340
Eyal Balla: almost no performance impact on the actual services.</p>
<p>193
00:27:20.490 --&gt; 00:27:26.070
Eyal Balla: And it allows you to understand better how your code behaves in production.</p>
<p>194
00:27:27.820 --&gt; 00:27:28.650
Eyal Balla: Okay?</p>
<p>195
00:27:29.270 --&gt; 00:27:30.080
Eyal Balla: Great</p>
<p>196
00:27:31.981 --&gt; 00:27:46.289
Eyal Balla: also, something you can do is you can monitor each of the the the different tasks. There's asking Kyle, get all tasks and</p>
<p>197
00:27:46.770 --&gt; 00:27:50.107
Eyal Balla: current tasks. So you can run</p>
<p>198
00:27:51.380 --&gt; 00:27:56.220
Eyal Balla: a core routine once in a while to understand what is running and</p>
<p>199
00:27:56.340 --&gt; 00:28:00.020
Eyal Balla: and get the stacks and understand the behavior.</p>
<p>200
00:28:00.690 --&gt; 00:28:01.500
Eyal Balla: Okay.</p>
<p>201
00:28:03.120 --&gt; 00:28:11.940
Eyal Balla: So this is about it. So I went over the Asynch concurrent programming framework.</p>
<p>202
00:28:12.702 --&gt; 00:28:27.219
Eyal Balla: I think we saw a real world example and understood a bit how I think I behaves, and something, and why we want we'd want to use it. And also we looked at some debug testing and exceptional handling tools.</p>
<p>203
00:28:27.810 --&gt; 00:28:28.950
Eyal Balla: and that's it.</p>
<p>204
00:28:31.170 --&gt; 00:28:32.000
Eyal Balla: Questions.</p>
<p>205
00:28:36.240 --&gt; 00:28:36.940
lapid: Can I?</p>
<p>206
00:28:39.040 --&gt; 00:28:40.060
lapid: Do you hear me?</p>
<p>207
00:28:40.820 --&gt; 00:28:41.869
Gabor Szabo: Yes, yes, we can hear you.</p>
<p>208
00:28:41.870 --&gt; 00:28:56.800
lapid: Oh, Hi, yeah, Hi, so you touched a little bit on it. But I'm when I'm doing something that I called it like a project that develops into something a little bit bigger. I find myself sometimes I I just get lost.</p>
<p>209
00:28:57.150 --&gt; 00:29:01.299
lapid: Oh, I I can't verify myself that I actually</p>
<p>210
00:29:02.800 --&gt; 00:29:06.060
lapid: control all the coroutines properly, because many times</p>
<p>211
00:29:06.430 --&gt; 00:29:20.690
lapid: queues that feed one another like I have some streaming, and then some queues and something like that. And so you touched a little bit on that, and how you how you monitor that! But can you expand a little bit like how do you deal with that cause. I I just</p>
<p>212
00:29:21.140 --&gt; 00:29:29.730
lapid: afterwards I go back, and I just print constantly, and I check the timing, and I waste a lot of time on that, and I feel like maybe someone more experienced has a better solution.</p>
<p>213
00:29:31.030 --&gt; 00:29:34.775
Eyal Balla: So I think, whe when you</p>
<p>214
00:29:35.460 --&gt; 00:29:41.375
Eyal Balla: the, I think that the 1st thing is to build your software like in</p>
<p>215
00:29:42.340 --&gt; 00:29:44.319
Eyal Balla: even though it's Async.</p>
<p>216
00:29:44.440 --&gt; 00:29:59.850
Eyal Balla: you need to build it like a top down architecture, understanding which parts are calling what other the other parts, and making sure that when you synchronize correctly, then, once you do that, things are easier. I think.</p>
<p>217
00:30:01.503 --&gt; 00:30:12.799
Eyal Balla: So you need so like other others. Other considerations in software development. You need to have a a solid design</p>
<p>218
00:30:13.250 --&gt; 00:30:17.059
Eyal Balla: as a as a beginning, right? And then you can use</p>
<p>219
00:30:17.707 --&gt; 00:30:21.539
Eyal Balla: something like the test monitor right here.</p>
<p>220
00:30:21.650 --&gt; 00:30:26.649
Eyal Balla: So you can add this as something that you can call within your code.</p>
<p>221
00:30:27.310 --&gt; 00:30:35.309
Eyal Balla: And this actually helps you understand the the different core routines that are run at the same time</p>
<p>222
00:30:35.530 --&gt; 00:30:57.230
Eyal Balla: and can help, you understand, together with the slow core routines, understand the impact of each of the the different coroutines running. And when you, I think that when you say I want to understand. So you have some kind of a problem, right? You have, maybe something that doesn't get the ability to run at all.</p>
<p>223
00:30:57.380 --&gt; 00:31:07.317
Eyal Balla: and you don't know why. So the reason to this is probably something is blocking the the main loop right? It's too long. So you'd get</p>
<p>224
00:31:08.220 --&gt; 00:31:17.930
Eyal Balla: messages on the slow callbacks. And then you would see this in the running tests and understand the context of how it ran.</p>
<p>225
00:31:19.290 --&gt; 00:31:21.089
Eyal Balla: So does this make sense.</p>
<p>226
00:31:21.380 --&gt; 00:31:46.119
lapid: Yeah, some something on that on that area. It it's more so that I when the polish gets big enough, I you know, I've designed patterns for code that I know that I follow. That helps me, you know, every time again, to a code that I didn't touch for you, I know, like, okay, this is how I what I do in order to actually add a new feature. But somehow, when I when I develop with, I think.</p>
<p>227
00:31:46.430 --&gt; 00:32:05.980
lapid: unless it's something the whole development many times like, if I want to change some change, something in the future, I find myself having to go very deep into the cold like, I don't mind this, or maybe I just don't know how to do a design partner. Well, and for my expense just affect the stuff that</p>
<p>228
00:32:06.150 --&gt; 00:32:15.399
lapid: change in their behavior from something that's asynchronousy to asynchronically had forced me to change my code way deeper than I wanted.</p>
<p>229
00:32:15.650 --&gt; 00:32:22.590
lapid: So this is what actually, I'm I'm curious about this is like, this is the pendate I experienced.</p>
<p>230
00:32:23.800 --&gt; 00:32:28.240
lapid: did it? Didn't like next like, Are you? Are you?</p>
<p>231
00:32:28.430 --&gt; 00:32:40.939
lapid: The the something changing in the future. I want to add something, and this thing now is let's say I. I have some. I'm scraping some information, and I'm leaving it, and I want to do it in parallel.</p>
<p>232
00:32:41.100 --&gt; 00:32:44.689
lapid: But and I have an existing project that was not</p>
<p>233
00:32:44.840 --&gt; 00:32:54.990
lapid: so so far didn't assume anything that has to be in parallel cause. Cause I I had a different data source that I used before, and it was way, way, way faster. So now.</p>
<p>234
00:32:54.990 --&gt; 00:32:55.840
lapid: so.</p>
<p>235
00:32:56.690 --&gt; 00:33:02.479
Eyal Balla: So I I think what you would do is you would add, maybe an Asyncare context to</p>
<p>236
00:33:02.620 --&gt; 00:33:06.920
Eyal Balla: part of the code right, and then</p>
<p>237
00:33:07.310 --&gt; 00:33:14.299
Eyal Balla: run it, maybe, with Asynchaire run, and the rest would remain synchronous.</p>
<p>238
00:33:14.780 --&gt; 00:33:20.479
Eyal Balla: So you can limit the the extent of what you're looking to.</p>
<p>239
00:33:21.090 --&gt; 00:33:27.019
Eyal Balla: and also, as always be sure to test the specific part</p>
<p>240
00:33:27.350 --&gt; 00:33:32.620
Eyal Balla: as a as as a different, like a different library, that you're calling</p>
<p>241
00:33:33.050 --&gt; 00:33:35.920
Eyal Balla: and and treat it like one</p>
<p>242
00:33:36.190 --&gt; 00:33:44.869
Eyal Balla: so like a different code component, a different part of the code and put it somewhere. That's self-contained</p>
<p>243
00:33:46.018 --&gt; 00:33:48.710
Eyal Balla: and maybe that can help.</p>
<p>244
00:33:49.990 --&gt; 00:34:00.119
lapid: Yeah. So so what you're describing is how I solved it. But I didn't. I was not. Actually. I was asking myself like if I had to go over the project again.</p>
<p>245
00:34:00.440 --&gt; 00:34:06.849
lapid: saying, Oh, maybe in the future some I will have some asynchronic way, some asynchronic part.</p>
<p>246
00:34:07.720 --&gt; 00:34:13.109
lapid: I want to actually prepare my code for the possibility of something as running in Async in the future.</p>
<p>247
00:34:13.980 --&gt; 00:34:17.570
Eyal Balla: So I can tell you that we had.</p>
<p>248
00:34:18.586 --&gt; 00:34:23.349
Eyal Balla: we needed to move from synchronous code to Async code in our company.</p>
<p>249
00:34:23.860 --&gt; 00:34:25.130
Eyal Balla: And this is a</p>
<p>250
00:34:25.620 --&gt; 00:34:31.530
Eyal Balla: this is, quite a big migration, because, as I, as I described in the beginning of the talk.</p>
<p>251
00:34:31.690 --&gt; 00:34:39.639
Eyal Balla: using Asyncahyo is like something very different than the design of a synchronous program.</p>
<p>252
00:34:40.050 --&gt; 00:34:43.910
Eyal Balla: So I don't think I have like a</p>
<p>253
00:34:44.429 --&gt; 00:34:56.759
Eyal Balla: anything that I can say. Well, if you write a synchronous problem, and you want to prepare, do this and that because I think that you need to look at it in a very different way. Writing Async code and sync code.</p>
<p>254
00:34:57.500 --&gt; 00:35:01.580
lapid: Okay. So it sounds like you went through the same problems I had. So.</p>
<p>255
00:35:02.470 --&gt; 00:35:02.940
Eyal Balla: Yeah.</p>
<p>256
00:35:02.940 --&gt; 00:35:04.480
lapid: At least we suffer together.</p>
<p>257
00:35:06.480 --&gt; 00:35:08.990
Eyal Balla: Suffering. Sharing is always good.</p>
<p>258
00:35:08.990 --&gt; 00:35:11.800
lapid: Yeah, yeah, thank you.</p>
<p>259
00:35:12.560 --&gt; 00:35:13.260
Eyal Balla: You're welcome.</p>
<p>260
00:35:15.480 --&gt; 00:35:16.050
Eyal Balla: Anything else.</p>
<p>261
00:35:16.050 --&gt; 00:35:21.090
Naty Harary: Yeah. Yeah, I have a question. I'm using a lot of 3rd party.</p>
<p>262
00:35:21.766 --&gt; 00:35:39.670
Naty Harary: Libraries. Test Api sequel. Alchemy like that. And they sometimes hide the implementation of I think I/O, and I always wondered, because I just believe it works well. Is there any way to worry the event loop. So I know.</p>
<p>263
00:35:39.910 --&gt; 00:35:43.440
Naty Harary: like, what's running right now.</p>
<p>264
00:35:43.810 --&gt; 00:35:48.510
Naty Harary: Is it even possible? Is that something that python hides from us completely?</p>
<p>265
00:35:48.690 --&gt; 00:35:56.280
Eyal Balla: So so this is like a way to query all the tasks that are running in Asyncayo.</p>
<p>266
00:35:56.680 --&gt; 00:35:57.110
Naty Harary: All right.</p>
<p>267
00:35:58.330 --&gt; 00:36:06.459
Eyal Balla: And also there's a library I did not talk about here, and it's called Aio Monitor. So you can look into that, too.</p>
<p>268
00:36:07.030 --&gt; 00:36:08.999
Eyal Balla: And it's very nice.</p>
<p>269
00:36:09.965 --&gt; 00:36:11.860
Eyal Balla: So you can try that, too.</p>
<p>270
00:36:12.550 --&gt; 00:36:15.994
Naty Harary: Alright cool cause you talked about the timing, so I didn't know if</p>
<p>271
00:36:16.470 --&gt; 00:36:19.789
Naty Harary: if there are the concerns but a or mine, too. That's cool.</p>
<p>272
00:36:19.930 --&gt; 00:36:20.890
Naty Harary: We'll check it.</p>
<p>273
00:36:22.530 --&gt; 00:36:23.380
Naty Harary: Thank you.</p>
<p>274
00:36:27.600 --&gt; 00:36:29.680
Eyal Balla: Okay, so I think we're done.</p>
<p>275
00:36:30.493 --&gt; 00:36:46.420
Eyal Balla: Thank you guys for listening. And you can reach me this email or Linkedin. And also there's a Github project with this presentation and all the code samples together.</p>
<p>276
00:36:47.370 --&gt; 00:36:51.650
Eyal Balla: I can. I think I sent it to Gabor last time, but I can send it again.</p>
<p>277
00:36:51.880 --&gt; 00:36:53.749
Eyal Balla: and he can spread it out.</p>
<p>278
00:36:53.970 --&gt; 00:36:58.322
Gabor Szabo: Yes, that would be a good idea. I think I'm going to include it in in the there is this</p>
<p>279
00:36:58.740 --&gt; 00:37:09.540
Gabor Szabo: web page about the the presentation which will be linked from the video. And and then on that page you'll see also, I include these these links as well.</p>
<p>280
00:37:09.810 --&gt; 00:37:12.380
Gabor Szabo: Oh, so your your Linkedin, and your</p>
<p>281
00:37:12.870 --&gt; 00:37:15.660
Gabor Szabo: and the link to your that Github page.</p>
<p>282
00:37:15.950 --&gt; 00:37:16.730
Gabor Szabo: We'll get to.</p>
<p>283
00:37:16.730 --&gt; 00:37:17.390
Eyal Balla: Okay. Great.</p>
<p>284
00:37:17.390 --&gt; 00:37:18.530
Gabor Szabo: Repository.</p>
<p>285
00:37:18.820 --&gt; 00:37:19.959
lapid: It's a nice, clear.</p>
<p>286
00:37:19.960 --&gt; 00:37:22.739
Gabor Szabo: No more questions. Then. Thank you very much.</p>
<p>287
00:37:23.530 --&gt; 00:37:27.019
Gabor Szabo: Yeah, thank you. Everyone for participating. And.</p>
<p>288
00:37:27.020 --&gt; 00:37:28.569
Eyal Balla: I think there's 1 more question.</p>
<p>289
00:37:28.570 --&gt; 00:37:31.890
lapid: Oh, I if I can ask chitchat questions.</p>
<p>290
00:37:32.120 --&gt; 00:37:32.950
Gabor Szabo: You're good. Go ahead.</p>
<p>291
00:37:32.950 --&gt; 00:37:37.289
lapid: So you said you you have a company like what your company does, and</p>
<p>292
00:37:37.430 --&gt; 00:37:39.700
lapid: can you elaborate a little bit for more.</p>
<p>293
00:37:40.840 --&gt; 00:37:43.977
Eyal Balla: Sure. So scenario, we do.</p>
<p>294
00:37:45.833 --&gt; 00:37:55.216
Eyal Balla: we do security for health for healthcare. So we give hospitals tools to understand security posture and</p>
<p>295
00:37:56.927 --&gt; 00:38:03.010
Eyal Balla: and attack detection. So we detect, malicious content and attacks on hospitals.</p>
<p>296
00:38:04.950 --&gt; 00:38:06.262
Eyal Balla: And I think,</p>
<p>297
00:38:07.670 --&gt; 00:38:16.929
Eyal Balla: because hospitals are very sensitive. So we need to handle like a very high scale, we do with with passive network inspection.</p>
<p>298
00:38:17.150 --&gt; 00:38:19.275
Eyal Balla: And so we handle like,</p>
<p>299
00:38:20.770 --&gt; 00:38:30.910
Eyal Balla: quite a lot of information in our cloud. So we need to use tools and also using asyncaio helps us</p>
<p>300
00:38:31.040 --&gt; 00:38:34.540
Eyal Balla: scale out and handle things as we need.</p>
<p>301
00:38:35.900 --&gt; 00:38:37.260
Eyal Balla: I hope that.</p>
<p>302
00:38:38.536 --&gt; 00:38:39.730
lapid: Get answered.</p>
<p>303
00:38:40.010 --&gt; 00:38:46.181
lapid: Yeah, no, I'm just curious, like, it's very. It's very far from my my expertise. I'm a i'm a data scientist. And</p>
<p>304
00:38:46.960 --&gt; 00:38:54.960
lapid: MI came across a sync when some of my project needed some boost.</p>
<p>305
00:38:55.770 --&gt; 00:39:01.929
lapid: And you're looking also for data scientist. I'm not available now. But in about a month or 2.</p>
<p>306
00:39:03.532 --&gt; 00:39:14.997
Eyal Balla: So I think data scientist is not something that we're currently looking for. But you, you could actually look at the company career page. There are several positions, and</p>
<p>307
00:39:16.130 --&gt; 00:39:19.450
Eyal Balla: we're expanding, and it's it's a good time, I think.</p>
<p>308
00:39:20.570 --&gt; 00:39:22.509
Eyal Balla: Alright. Then brush.</p>
<p>309
00:39:24.170 --&gt; 00:39:24.860
lapid: Alright. Thank you.</p>
<p>310
00:39:26.120 --&gt; 00:39:32.120
Gabor Szabo: So. So thank you. Thank you very much. Thank you, Al, for the presentation, and for all the questions people</p>
<p>311
00:39:32.430 --&gt; 00:39:38.840
Gabor Szabo: and everyone who was watching, please again, like the video, as I told you, and follow the Channel.</p>
<p>312
00:39:38.980 --&gt; 00:39:46.519
Gabor Szabo: And if you would like to present at one of our meetings, then please get in touch with me.</p>
<p>313
00:39:46.790 --&gt; 00:39:53.759
Gabor Szabo: I would be happy to to provide the the place for people to to share their their knowledge.</p>
<p>314
00:39:54.540 --&gt; 00:39:55.679
Gabor Szabo: Thank you very much.</p>
<p>315
00:39:56.000 --&gt; 00:39:56.850
Gabor Szabo: Goodbye.</p>
<p>316
00:39:56.850 --&gt; 00:39:57.250
Eyal Balla: Bye-bye.</p>
<p>317
00:39:57.250 --&gt; 00:39:57.590
Dmitry Morgovsky: Sure.</p>
<p>318
00:39:57.590 --&gt; 00:39:58.919
lapid: Bye-bye. Thank you.</p>
<p>319
00:39:59.980 --&gt; 00:40:01.209
Shalaka Deshan: Thank you, anyway.</p>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>The Evolution of Python Monitoring with May Walter</title>
    <summary type="html"><![CDATA[]]></summary>
    <updated>2025-02-06T08:30:01Z</updated>
    <pubDate>2025-02-06T08:30:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/the-evolution-of-python-monitoring" />
    <id>https://python.code-maven.com/the-evolution-of-python-monitoring</id>
    <content type="html"><![CDATA[<p>In this lecture, we will explore the evolution of Python monitoring over the years, covering tools and techniques from sys.monitoring to import hooks, highlighting advancements and best practices in keeping your Python code in check.</p>
<p>Join us and time-travel across the evolution of Python monitoring mechanisms. We'll delve into history from dedicated tools like sys.monitoring to more advanced techniques such as ceval and import hooks. This session will provide a comprehensive overview of how monitoring practices have developed over the years, offering insights into the best practices for maintaining and debugging your Python code and the pros and cons of each approach. Whether you're a seasoned developer or new to Python, you'll gain valuable knowledge on how to keep your code running smoothly and efficiently without hurting performance or your dev velocity with tedious maintenance.</p>
<p><img src="images/may-walter.jpeg" alt="May Walter" /></p>
<p><a href="https://www.linkedin.com/in/may-walterr/">May Walter</a></p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/XIqoYYWaWFI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<h2 class="title is-4">Transcript</h2>
<p>1
00:00:00.720 --&gt; 00:00:02.690
Haki Benita: This meeting is being recorded.</p>
<p>2
00:00:03.400 --&gt; 00:00:04.320
Gabor Szabo: Okay.</p>
<p>3
00:00:05.800 --&gt; 00:00:12.250
Gabor Szabo: yeah. So Hi, and welcome to the python Maven, let's call it Python Maven. This is the code Maven</p>
<p>4
00:00:12.500 --&gt; 00:00:41.910
Gabor Szabo: Youtube channel. And we are organizing these meetings in the Codebay Events group, but sort of it has 3 separate sessions, and this is going to be the the Python specific one. My name is Gabor Sabo. I usually teach python and rust and help companies introduce testing, and I also like to organize these events and allow people to share their knowledge with with each other.</p>
<p>5
00:00:42.270 --&gt; 00:00:46.010
Gabor Szabo: You're welcome. I'm really happy that you're here</p>
<p>6
00:00:46.140 --&gt; 00:01:04.909
Gabor Szabo: in this session, listening, as I mentioned earlier, you're welcome to to comment or use the chat and ask questions. And if you're just watching the video recorded on Youtube, then please remember to like the video and follow the channel.</p>
<p>7
00:01:05.080 --&gt; 00:01:11.990
Gabor Szabo: and let's welcome hockey now, and let him introduce you. Introduce yourself and and</p>
<p>8
00:01:12.700 --&gt; 00:01:17.579
Gabor Szabo: and give the presentation. So thank you for accepting the invitation.</p>
<p>9
00:01:18.970 --&gt; 00:01:31.149
Haki Benita: Thank you. Thank you, Gabo. 1st of all, I like the fact that we have this intimate group that we can freely talk. I actually encourage you to consider opening the mics.</p>
<p>10
00:01:31.210 --&gt; 00:02:01.090
Haki Benita: Because I think we can actually have a conversation throughout the presentation. I like to give interactive presentation. Your call. You're the boss, and just a quick introduction about the subject and about myself. So we are going to talk about how to make your back end war. And I want to start by apologizing for the tacky headline. But unfortunately, these types of tacky headlines do work. Believe it or not.</p>
<p>11
00:02:01.610 --&gt; 00:02:09.010
Haki Benita: So. My name is Jake Benita. I'm a software developer and a technical lead. I'm currently leading a team</p>
<p>12
00:02:09.289 --&gt; 00:02:18.949
Haki Benita: of developers working on a very large ticketing platform and Israel serving about one and a half unique</p>
<p>13
00:02:19.580 --&gt; 00:02:32.470
Haki Benita: 1.5 million unique paying users every month. And I also like to write and talk about python performance and databases. And you can find my stuff on my website.</p>
<p>14
00:02:33.110 --&gt; 00:02:47.839
Haki Benita: So today, we are going to talk about some lesser known features of indexes. And we're going to try and understand how they work and when we can and should use them</p>
<p>15
00:02:47.850 --&gt; 00:03:14.629
Haki Benita: to do that, we are going to build a URL shortener together, and we're going to do it in Django. I would say that since this is a talk about python, I'm going to use Django and the Django Orm. But the concepts that I'm going to describe are not specific to Django, and they're not specific to Postgres. Heck. They're not even specific to python. But this is a good environment to explain the concepts with.</p>
<p>16
00:03:15.390 --&gt; 00:03:19.889
Haki Benita: So what is a URL shortener? You probably know about</p>
<p>17
00:03:19.900 --&gt; 00:03:39.330
Haki Benita: other types of URL shorteners? You have bitly. You have the late googl buffer, Li, and so on. Basically, URL shortener is a system that provides a short URL that redirects to a longer URL. Now, why would you want to do that</p>
<p>18
00:03:39.330 --&gt; 00:04:02.240
Haki Benita: first.st If you are operating in text constrained environments like SMS messages or Tweets, you might want to share a very large link. So you want to make it shorter, so it consumes less space. This is where short Urls can be handy. Another nice feature of URL shortening that whenever someone clicks the short URL,</p>
<p>19
00:04:02.240 --&gt; 00:04:16.500
Haki Benita: the URL shorten and redirects to the long URL, and keeps a track of how many people click that link. So if you have something like a campaign that you want to launch, and you want to keep track of how many people clicked your link.</p>
<p>20
00:04:16.820 --&gt; 00:04:20.149
Haki Benita: This is what you would use a URL shortener for</p>
<p>21
00:04:20.310 --&gt; 00:04:48.240
Haki Benita: so to build our URL shortener in Django, we're going to start with this very, very simple model. We are calling the model short URL, we have an Id column which is the primary key. It's just an auto incrementing integer field. We have the key. That's a unique short piece of text that uniquely identifies our short URL. This is the short key at the end of the short URL.</p>
<p>22
00:04:48.500 --&gt; 00:05:07.030
Haki Benita: We then have the URL, which is the long URL, we want to redirect to. We also want to keep track of when the URL was created. We do that using the created at column. And finally, we want to keep track of how many users click the link, and we do that with the hits column</p>
<p>23
00:05:07.180 --&gt; 00:05:08.110
Haki Benita: at the bottom.</p>
<p>24
00:05:08.960 --&gt; 00:05:19.650
Haki Benita: So for our demonstration. So we actually have something to work with. I loaded 1 million short Urls into the table. Okay, now, this is not a lot. But we are going to see, some</p>
<p>25
00:05:20.700 --&gt; 00:05:25.929
Haki Benita: performance gains with just 1 million rows. Okay.</p>
<p>26
00:05:26.810 --&gt; 00:05:33.380
Haki Benita: so this talk is about python. But it's essentially about SQL, so</p>
<p>27
00:05:33.510 --&gt; 00:05:54.859
Haki Benita: in Django, if you want to get the SQL. Generated by Django for a given query set. You can do that by accessing the query, set dot query and print it. In this case I'm doing short URL filter on a specific key dot query. And I can actually get Django to print</p>
<p>28
00:05:55.190 --&gt; 00:05:59.549
Haki Benita: the SQL. That it generated for this query set right.</p>
<p>29
00:06:00.040 --&gt; 00:06:26.740
Haki Benita: So, after viewing the query set, it's also very interesting to see how the database is planning to execute my query. Right? So I can do that by executing the function. Explain. This, translates into an explain query, command in SQL. And what I get in return is not the results of the query, but the execution plan, which is how the database is planning</p>
<p>30
00:06:26.930 --&gt; 00:06:30.979
Haki Benita: to execute my query. Now, when we just use, explain</p>
<p>31
00:06:31.200 --&gt; 00:06:36.260
Haki Benita: the database doesn't actually execute the query. It just produces a plan</p>
<p>32
00:06:36.370 --&gt; 00:06:53.839
Haki Benita: sometimes, especially when we're benchmarking and we're trying to improve performance. It can be useful to produce the execution plan, but also have the database, execute this query and return some useful execution data. For that we can use a slightly different variation of the explain command.</p>
<p>33
00:06:53.970 --&gt; 00:07:13.319
Haki Benita: which is, explain, analyze in Django. You can do that by using. Explain, analyze. True in SQL. Postgres. Specifically, you can do explain, analyze on timing on in parenthesis, following by the query, and then you get some additional information about the execution plan.</p>
<p>34
00:07:13.350 --&gt; 00:07:27.339
Haki Benita: first, st because the database actually executed the query. You can see at the bottom that we get how long it took the database to produce an execution plan in this case that would be 0 point 1 4 0 ms.</p>
<p>35
00:07:27.710 --&gt; 00:07:38.510
Haki Benita: and I also get how long. It took the database to execute the query from start to end. In this case that would be 0 point 0 4 6. Okay.</p>
<p>36
00:07:39.430 --&gt; 00:07:47.120
Haki Benita: Now, in addition to the timing. I'm also getting a very, very interesting piece of information inside the execution plan.</p>
<p>37
00:07:47.260 --&gt; 00:07:53.699
Haki Benita: Okay, what I get is the estimated cost and the actual cost</p>
<p>38
00:07:53.820 --&gt; 00:07:58.059
Haki Benita: that the database encountered while executing the query. So</p>
<p>39
00:07:59.010 --&gt; 00:08:15.400
Haki Benita: discussing the cost-based optimizer is slightly outside the scope of this talk, I would just say that, comparing the expected cost to the actual cost is a very useful measure to try and identify bad execution plans.</p>
<p>40
00:08:16.100 --&gt; 00:08:17.350
Haki Benita: Finally.</p>
<p>41
00:08:17.990 --&gt; 00:08:28.419
Haki Benita: another way of viewing queries is to turn on the logger for the database backend in Django. This way, whenever the database, whenever Django executes a query.</p>
<p>42
00:08:29.040 --&gt; 00:08:32.620
Haki Benita: it logs the SQL. That was produced by the aura.</p>
<p>43
00:08:33.510 --&gt; 00:08:34.475
Haki Benita: So</p>
<p>44
00:08:35.700 --&gt; 00:09:05.329
Haki Benita: to actually start discussing some indexing techniques, we need to start implementing some. You know, business processes. So let's start with the most basic thing that URL shortener actually does. And that's look up the URL to redirect to by a key. So a user uses one of our short Urls, we get the unique key. And we need to find the long URL to redirect to. Okay, this is like the bread and butter of this system.</p>
<p>45
00:09:05.440 --&gt; 00:09:27.109
Haki Benita: So if we want to implement this very, very simple function. We can do something like that. Def resolve. Okay, that's the name of the function. We want to resolve a key to a URL. We accept a key, and then we execute this simple query to just get a show, URL for this key. If we don't find anything we return none. Otherwise we return the URL to redirect to</p>
<p>46
00:09:27.110 --&gt; 00:09:37.730
Haki Benita: okay. Now we want to look at the SQL. That Django generated for this function. Right? So we execute this function on some random key</p>
<p>47
00:09:37.950 --&gt; 00:09:57.950
Haki Benita: with SQL. Logging turned on, and we can see the query right here. Now, if you look at this query, it looks like Django, basically fetch everything from the short URL table for the key that we asked for right select star from short URL, where Key equals something.</p>
<p>48
00:09:58.270 --&gt; 00:10:05.050
Haki Benita: If we want to look at how postgres is actually executing this query.</p>
<p>49
00:10:05.210 --&gt; 00:10:12.719
Haki Benita: we can use the explain command. And what we get is that Postgres is planning to use an index scan</p>
<p>50
00:10:13.535 --&gt; 00:10:20.159
Haki Benita: on the index we have on the key column. Okay, now.</p>
<p>51
00:10:21.180 --&gt; 00:10:28.839
Haki Benita: to understand what exactly it means in index scan. Let's take a second to talk about btree index.</p>
<p>52
00:10:29.040 --&gt; 00:10:42.120
Haki Benita: So Btree index is like the king of all indexes. This is the default index in most database engines. If you're not sure what type of index you're using. You're probably using a btree index. Okay?</p>
<p>53
00:10:42.560 --&gt; 00:11:11.160
Haki Benita: So to understand how A B tree index works. Let's start by building one. So imagine you have these values, one through 9, and you want to create a B tree index on them. You start by sorting the values and storing them in leaf blocks. You can see the leaf blocks at the bottom. They are sorted from left to right. We have 1, 2, 3, all the way through 9. Now every entry in the leaf blocks contains a list of tids. These are pointers to rows in the table.</p>
<p>54
00:11:11.400 --&gt; 00:11:15.460
Haki Benita: That store rows with these values. Okay.</p>
<p>55
00:11:16.290 --&gt; 00:11:28.179
Haki Benita: now, above the leaves, we have branches and root block that acts as a directory to these leaf blocks. So let's see how this works. Let's imagine that we want to look.</p>
<p>56
00:11:28.180 --&gt; 00:11:38.290
Gabor Szabo: Sorry. Just someone says says that it doesn't see this the slides. So I just wanted to. And I'm unsure if the other people do see the light slide. So if</p>
<p>57
00:11:38.670 --&gt; 00:11:53.529
Gabor Szabo: I asked it in the chat, but no one answered. So I hope that people other people okay. So some other people see it so my recommendation is to Eduard Eduardo to turn, maybe on and off the I mean, maybe exit zoom and enter zoom again. Sorry for the.</p>
<p>58
00:11:53.530 --&gt; 00:11:54.940
Haki Benita: Okay, no problem.</p>
<p>59
00:11:55.120 --&gt; 00:11:56.160
Haki Benita: Yeah.</p>
<p>60
00:11:56.400 --&gt; 00:11:59.700
Haki Benita: Okay, okay, so let's</p>
<p>61
00:12:01.690 --&gt; 00:12:31.100
Haki Benita: okay. So let's try to search for the value 5 in the V 3 index that we just built. So we start with the root block and we start scanning from left to right. So 5 is larger than 3. So we skip the 1st entry 3 is between 3 and 7, 5 is between 3 and 7, so we follow this pointer to the middle leaf block. We then start scanning the leaf block from left to right. The 1st value is 4. It's not a match.</p>
<p>62
00:12:31.100 --&gt; 00:12:36.150
Haki Benita: The next value is 5. That's a match, and now we can</p>
<p>63
00:12:36.150 --&gt; 00:12:47.970
Haki Benita: scan. We can follow the pointers from this leaf block to the rows in the table. We can read the rows and do whatever we need to do with these rows. Okay, now.</p>
<p>64
00:12:48.310 --&gt; 00:13:15.100
Haki Benita: let's go back to our query. Okay, one second, yeah. Let's go back to our query. Remember that we said that Django generated this query and this query is fetching everything right, basically select star from short URL. But, in fact, if you think about it, we don't actually care about all these fields right? We only care about the URL. I mean, we're not looking to resolve</p>
<p>65
00:13:16.290 --&gt; 00:13:27.129
Haki Benita: a key to a URL for the purpose of redirecting. I don't care when it was created. I don't care about the Id. I already have the key right, and I don't care about the head counter at this point</p>
<p>66
00:13:27.610 --&gt; 00:13:30.209
Haki Benita: right? So I don't care about all these fields. So</p>
<p>67
00:13:30.770 --&gt; 00:13:55.089
Haki Benita: one thing that we can do is instead of fetching all of these fields, how about if we just fetch what we actually need. Right? So in Django, we can do that by adding values list. URL. Now the function is slightly different. But if we look at the SQL. Generated by this function, we can see that now, instead of fetching all the columns in the row, we just fetch the URL. So this is exactly what we need.</p>
<p>68
00:13:55.200 --&gt; 00:14:10.249
Haki Benita: If we look at this execution plan once again for this query, we can see that again. Django is using postgres is using an index scan on the unique index that we have on the key. Right? So now.</p>
<p>69
00:14:10.920 --&gt; 00:14:30.719
Haki Benita: once we found a matching row, we can follow the pointer to the table. We can get the URL from the table. So if you imagine the amount of disk reads, I need to do to satisfy this query. I'm starting by reading their root block. Right? So that's 1 read. Then I need to follow the branch all the way to the leaf. Let's say that we have just.</p>
<p>70
00:14:30.730 --&gt; 00:14:41.789
Haki Benita: you know, root block, and then directly to the leaf. So reading the leaf is another read, and then we need to follow the link from the leaf block to read the row from the table. So this is a unique</p>
<p>71
00:14:41.970 --&gt; 00:14:52.020
Haki Benita: column. So we have at most one row. So that's another read. So basically, we did 3 random reads to satisfy this query right now.</p>
<p>72
00:14:53.290 --&gt; 00:15:03.019
Haki Benita: this query is executed a lot. This is basically what our system is doing right. It's getting keys and resolving them to Urls to redirect right</p>
<p>73
00:15:03.360 --&gt; 00:15:17.979
Haki Benita: now. We already established that all we care about in this specific scenario is just the URL. I don't care about anything else. I care just about the URL. So what if? And stay with me? This is mind blowing.</p>
<p>74
00:15:17.980 --&gt; 00:15:34.249
Haki Benita: What if, instead of going to the table to get the URL. What if I could include the URL in the leaf block in the index this way? When I found a matching entry in the leaf block, I would have the URL just sitting there.</p>
<p>75
00:15:34.310 --&gt; 00:15:52.420
Haki Benita: Right? So this mind-blowing idea is called inclusive index. Okay, in other databases it's called covering index or inclusive indexes, and what it allows us to do, it allows us to store additional information in the leaf block.</p>
<p>76
00:15:52.500 --&gt; 00:16:14.569
Haki Benita: So if we want to use an inclusive index in Django, we can add the include argument to the unique constraint. Now look, the key is indexed. The URL is not indexed. It's just included in the leaf block. Okay. Now, if we generate a migration, we apply it and we try the query again.</p>
<p>77
00:16:15.500 --&gt; 00:16:21.569
Haki Benita: You can see that once again, Postgres is using our index, our unique index on the key. But there is</p>
<p>78
00:16:21.900 --&gt; 00:16:33.889
Haki Benita: very, very subtle difference here. If you notice. Previously we had an index scan using our unique index. This time we have an index only scan.</p>
<p>79
00:16:34.020 --&gt; 00:17:03.620
Haki Benita: This means that Postgres was able to satisfy the query without accessing the table. All the data that it needs was already in the leaf block. So if we once again imagine how many reads we need to do to satisfy this query, using the inclusive index, we read the root block. We follow the pointer all the way down to the leaf block, and now, instead of going to the table to read the URL. We have the URL right there in the leaf block. So we only need to read</p>
<p>80
00:17:03.670 --&gt; 00:17:05.849
Haki Benita: 2 blocks from disk.</p>
<p>81
00:17:06.150 --&gt; 00:17:17.110
Haki Benita: Okay, the way to identify. This is by the operator on the index only scan right? So we have an index scan, and we have an index. Only scan.</p>
<p>82
00:17:18.170 --&gt; 00:17:39.170
Haki Benita: So quick recap about inclusive indexes, as I mentioned in other databases. They are sometimes called covering indexes, and they allow us to fulfill queries without accessing the table. However, you should use them with caution. Because if you think about it, we're basically duplicating data from the table to the index. Okay?</p>
<p>83
00:17:39.170 --&gt; 00:17:49.959
Haki Benita: So if you have a very big big piece of like information like URL can be very, very big. So basically, I'm now storing the URL</p>
<p>84
00:17:50.140 --&gt; 00:18:09.440
Haki Benita: twice. So the index could get very, very big. I'm actually not a big fan of inclusive indexes. But I can think of 2 scenarios where it might be a good idea. First, st if you have very wide tables. Imagine, like data, warehouse type of tables, denormalized tables.</p>
<p>85
00:18:09.600 --&gt; 00:18:11.520
Haki Benita: and you have a very</p>
<p>86
00:18:12.250 --&gt; 00:18:22.290
Haki Benita: predefined set of queries that are executed very, very often on a very, very small subset of columns, you can consider doing using</p>
<p>87
00:18:23.440 --&gt; 00:18:50.249
Haki Benita: an inclusive index. And also, I personally found that non unique composite indexes can be good candidates for inclusive indexes that is, indexes on multiple columns that are not used to enforce a unique constraint. Sometimes they can benefit from switching to just a composite index to an inclusive index. Okay, questions so far before we move on to the next use case.</p>
<p>88
00:18:55.710 --&gt; 00:19:02.210
Haki Benita: Okay, if you have any questions, feel free, let's move on to the next to the next use case.</p>
<p>89
00:19:02.800 --&gt; 00:19:04.080
Haki Benita: So now</p>
<p>90
00:19:04.230 --&gt; 00:19:16.229
Haki Benita: we want to find unused keys right? We have this business question. We want to know how many show through ours. We have with no hits at all. Okay, we have 0 hits.</p>
<p>91
00:19:17.070 --&gt; 00:19:23.050
Haki Benita: So we start by implementing this very, very simple function. We call it, find unusedindexes.</p>
<p>92
00:19:23.350 --&gt; 00:19:26.190
Haki Benita: and it returns a query set where</p>
<p>93
00:19:26.790 --&gt; 00:19:43.480
Haki Benita: with short Urls, where hits equals 0. Once again, if we want to see what the query looks like we can print the result of query. We can see that it returns like star from short URL, where hits equals 0.</p>
<p>94
00:19:44.560 --&gt; 00:19:58.929
Haki Benita: Once again, through the process, we produce an execution plan. This time. We can see that Postgres is doing a sequential scan on short URL. A sequential scan is basically a full table. Scan postgres is just</p>
<p>95
00:19:59.010 --&gt; 00:20:18.369
Haki Benita: reading the table row by row, looking for rows where hits equals 0. We can see that the execution time at the bottom is 116 ms. Let's say, for the sake of discussion, that this is very, very slow, and we want to try and improve that.</p>
<p>96
00:20:18.450 --&gt; 00:20:48.250
Haki Benita: So if you go to like 99% of developers at Dba, they will tell you what's the problem and just slap a B tree on it. Right. So we add a B tree index on the hits column. We do that in Django using dB index equals. True, we generate a migration. We apply the migration. We once again produce the execution plan with, analyze, and lo and behold.</p>
<p>97
00:20:48.310 --&gt; 00:20:56.180
Haki Benita: Postgres is using our index short. URL hits Ix. And, as you can see the execution. Time</p>
<p>98
00:20:56.810 --&gt; 00:21:02.370
Haki Benita: is very, very fast compared to before, so we're done right.</p>
<p>99
00:21:03.230 --&gt; 00:21:06.060
Haki Benita: We can call it the day we can go for lunch.</p>
<p>100
00:21:06.330 --&gt; 00:21:08.609
Haki Benita: We're happy. It's fast. Now</p>
<p>101
00:21:09.310 --&gt; 00:21:20.299
Haki Benita: stop, let's take a second to talk about performance and what it actually means. Okay? Because intuitively, when we talk about performance, we talk about</p>
<p>102
00:21:20.380 --&gt; 00:21:37.639
Haki Benita: speed right? We want things to be very, very quick. But I think, or the way I view performance is that we need to balance different types of resources. And I want to illustrate this with an example. Okay, let's say that you have this batch processing job running at night.</p>
<p>103
00:21:37.640 --&gt; 00:21:53.420
Haki Benita: Now, this batch processing job runs at the middle of the night, where you have very, very little users, and it runs very, very fast. It takes like this batch processing job like 10 seconds to complete. You're so happy, so fast. However.</p>
<p>104
00:21:53.720 --&gt; 00:22:05.569
Haki Benita: however, this job consumes huge amounts of memory, huge amounts of CPU and huge amounts of disk space right. What if I told you that</p>
<p>105
00:22:06.440 --&gt; 00:22:12.950
Haki Benita: if we are willing to compromise, and instead of completing in 10 seconds, it takes a minute</p>
<p>106
00:22:13.410 --&gt; 00:22:38.970
Haki Benita: right? It consumes very little memory disk space and CPU, right? I'm guessing that if you pay a lot of money for memory, you are willing to make this compromise. Okay, I'll give you another example. Let's say that you have this background job running in the middle of the day. Right now, this background job consumes a lot of CPU so much CPU, in fact, that it starts to interfere with user traffic in the system.</p>
<p>107
00:22:39.030 --&gt; 00:23:07.120
Haki Benita: In this case, instead of optimizing for time, you might be optimizing for CPU, right? You're willing to compromise a few seconds. But you don't want the background job to consume a lot of CPU. So when we talk about performance. We talk about more than just speed. We're talking about how we can balance different resources in the system, usually depending on some type of context time of day the type of resource that we have available at this time. Right?</p>
<p>108
00:23:07.670 --&gt; 00:23:23.450
Haki Benita: So remember that we slapped A B tree on it right? And it was very, very fast, but I'm not sure that was like the most optimal thing that we could done. We could have done. So. Let's go to the database and see</p>
<p>109
00:23:23.580 --&gt; 00:23:33.769
Haki Benita: and check the size of the index we created to solve this teeny, tiny problem. Okay, so this index.</p>
<p>110
00:23:34.570 --&gt; 00:23:41.979
Haki Benita: right is 7 MB. Okay, so that's pretty big for for this type of index.</p>
<p>111
00:23:42.120 --&gt; 00:23:47.420
Haki Benita: So our 7 MB index includes</p>
<p>112
00:23:47.630 --&gt; 00:23:57.789
Haki Benita: all the rows in the table. Right? We just add a dB index through create a B tree index on the column. So it contains all the 1 million rows in the table. But</p>
<p>113
00:23:58.570 --&gt; 00:24:05.790
Haki Benita: we actually don't care about all the rows in the table. Right? Nobody asked us how many</p>
<p>114
00:24:06.150 --&gt; 00:24:25.690
Haki Benita: short Urls you have with less than 5 hits, or more than 266 hits, or exactly 1,000 hits. Nobody cares about that. We had a very specific question that we wanted to answer in regards to the hits. We wanted to find how many short Urls we have with exactly 0 hits.</p>
<p>115
00:24:26.100 --&gt; 00:24:37.350
Haki Benita: So what if, instead of indexing the all the rows in the table, we could index just a portion of the rows, the part of the table that we actually care about.</p>
<p>116
00:24:37.810 --&gt; 00:24:51.950
Haki Benita: Right? So this is a once again mind-blowing idea, and this is made possible with something called partial indexes. Partial indexes, allows us to index just a part of the table that we actually care about.</p>
<p>117
00:24:52.810 --&gt; 00:25:08.019
Haki Benita: So going back to our Django model right. 1st we start by removing the dB index from the column definition, you should never use dB index. Regardless of this, and then, instead of adding this default index. On the column.</p>
<p>118
00:25:08.020 --&gt; 00:25:28.989
Haki Benita: we add a proper index. Right? But we add a condition. Okay, so what this does, it creates an index on the Id column with a condition where hits equals 0. This would cause postgres to create an index just on the rows that satisfy this query. Just on rows</p>
<p>119
00:25:29.200 --&gt; 00:25:54.569
Haki Benita: where hits equal 0. Right? So we generate the migration. We apply the migration, and we try the query again. We produce an execution plan, and we can see that Postgres is using our index. Right? We see an index scan using short URL unused part Ix. This is the index we just created. Okay, so Postgres is able to use the index we just created the partial index</p>
<p>120
00:25:55.000 --&gt; 00:26:04.670
Haki Benita: to satisfy this very specific query. We can also see that the query is very, very fast, even compared to the full index. Right?</p>
<p>121
00:26:05.090 --&gt; 00:26:13.180
Haki Benita: But that wasn't the motivation here, right? This is not what we look to optimize. If we go back</p>
<p>122
00:26:13.320 --&gt; 00:26:28.990
Haki Benita: to the database, and we look at the size of this index. Look at that. The partial index is just 88 kB in size. Okay? Previously the full index was 7 MB. The partial index is 88 kB.</p>
<p>123
00:26:28.990 --&gt; 00:26:48.659
Haki Benita: So I did the math seriously. I opened excel. I did the math. That's 99% smaller. Okay, so that's a lot of space. Now, at this point you're probably saying, Come on, man, it's just 7 MB. Who cares? But if you go back to your system, and you have huge tables with hundreds and billions of rows. Right?</p>
<p>124
00:26:48.840 --&gt; 00:27:06.290
Haki Benita: Check the size of your B 3 indexes. They can become huge. I've seen situations where the B 3 index was larger than the table. Okay, and if you have a lot of indexes it can grow out of control very, very quickly.</p>
<p>125
00:27:07.020 --&gt; 00:27:21.090
Haki Benita: So, as you may guess, I'm a very, very big fan of partial indexes. They produce smaller indexes, and I highly encourage you to use them whenever possible. One limitation of partial indexes is that</p>
<p>126
00:27:22.030 --&gt; 00:27:26.349
Haki Benita: the database can only use partial indexes when</p>
<p>127
00:27:26.500 --&gt; 00:27:52.249
Haki Benita: the query uses the exact same condition as the predicate in the index. Right? The database is not even smart enough to do something like like, where hits equal one minus one. Okay to this level. Okay, so it's limited to queries that use the exact same condition. Usually it's fine, because you know, why would you do hits equal one minus one.</p>
<p>128
00:27:52.380 --&gt; 00:27:53.080
Haki Benita: I don't know.</p>
<p>129
00:27:53.520 --&gt; 00:27:58.490
Haki Benita: I personally found that noble columns are great candidates</p>
<p>130
00:27:58.780 --&gt; 00:28:09.290
Haki Benita: for partial indexes, because in postgres, for example, null values are indexed, and usually you don't want to use an index for is null queries. So I found that</p>
<p>131
00:28:09.480 --&gt; 00:28:34.749
Haki Benita: whenever I have a noble column with an index on it, I can benefit from making it a partial indexes. In fact, I wrote an entire article on how we save 20 GB of unused disk space simply by identifying noble columns with indexes and switching them to use partial index. Okay, so questions about partial indexes before we move on to the next use case.</p>
<p>132
00:28:36.780 --&gt; 00:28:38.540
Haki Benita: Gabor, you have a question.</p>
<p>133
00:28:42.160 --&gt; 00:28:42.730
Haki Benita: No.</p>
<p>134
00:28:42.730 --&gt; 00:28:45.249
Gabor Szabo: Sorry there is this sorry? Actually, there is this question.</p>
<p>135
00:28:46.340 --&gt; 00:29:04.110
Haki Benita: Oh, is it a good idea to recalculate the hits and partial indexes? How frequently! Well, the nice thing about indexes and btrees in general that they are always in sync with the data in the table, it's actually part of the transaction. So when you, for example, increment.</p>
<p>136
00:29:05.180 --&gt; 00:29:07.990
Haki Benita: when you increment the counter for the 1st time</p>
<p>137
00:29:08.290 --&gt; 00:29:11.070
Haki Benita: the row would just disappear from the index.</p>
<p>138
00:29:11.250 --&gt; 00:29:26.029
Haki Benita: Right? So I'm guessing that you're asking, because you have some experience with like materialized views and stuff like that. So you don't actually have to maintain it actively. It's just maintained by the database.</p>
<p>139
00:29:26.460 --&gt; 00:29:32.839
Haki Benita: It's truly an amazing feature. You should definitely use that any more questions before we move on to</p>
<p>140
00:29:33.140 --&gt; 00:29:36.009
Haki Benita: a very exotic type of index in postgres.</p>
<p>141
00:29:36.750 --&gt; 00:29:38.110
Haki Benita: Ow.</p>
<p>142
00:29:41.210 --&gt; 00:29:46.360
Haki Benita: okay, great. So let's talk about another type. Another use case.</p>
<p>143
00:29:47.270 --&gt; 00:30:00.790
Haki Benita: So first, st in the 1st use case, we wanted to resolve the key to a URL right? This is the redirect action. This time we want to do a reverse. Look up. We want to ask</p>
<p>144
00:30:01.000 --&gt; 00:30:09.090
Haki Benita: how many keys we have pointing to this specific URL. So we wanna search for keys by the URL.</p>
<p>145
00:30:09.530 --&gt; 00:30:20.539
Haki Benita: So we implement this very simple function called reverse lookup. It accepts a URL and returns query, set of short Urls. Okay?</p>
<p>146
00:30:21.210 --&gt; 00:30:49.150
Haki Benita: So if we want to see what the query looks like. We use dot query. And we can see select star from short URL where URL equals something. Okay, if we produce an execution plan. We can see that the database is doing a sequential scan on the short URL table that is, scanning the entire table, sifting row by row, finding matches for our query.</p>
<p>147
00:30:49.430 --&gt; 00:30:50.800
Haki Benita: Whoa!</p>
<p>148
00:30:51.590 --&gt; 00:30:55.929
Haki Benita: And we can see that it's relatively</p>
<p>149
00:30:56.140 --&gt; 00:31:00.379
Haki Benita: slow, right? It's like 105</p>
<p>150
00:31:00.500 --&gt; 00:31:03.990
Haki Benita: milliseconds so compared to the index</p>
<p>151
00:31:04.320 --&gt; 00:31:08.840
Haki Benita: queries that we saw before. That's that's pretty slow. Right?</p>
<p>152
00:31:09.220 --&gt; 00:31:23.659
Haki Benita: So you know, once again, 99% of the people would just say, Come on, man, I'm hungry. Let's order some food. Just slop a B tree on it. So this is what we do right? We start by adding A B tree on the URL</p>
<p>153
00:31:23.860 --&gt; 00:31:37.679
Haki Benita: right? We generate and apply the migration. Now we execute the exact same query again, and we can see that now Postgres is using the index that we just created. We can see an index scan using</p>
<p>154
00:31:38.030 --&gt; 00:31:57.059
Haki Benita: the index on the URL column, and also it's very fast. Previously it was like a 100 ms. Now it's 0 point 1 ms. So that's a very, very big and significant improvement. We can all seek to launch and be very, very happy and satisfied with ourselves. But</p>
<p>155
00:31:57.770 --&gt; 00:32:09.459
Haki Benita: are we done? Do you think that we are done? Is there anything that we can optimize? Now, if you are paying attention throughout this presentation. You know that we can definitely do better than that.</p>
<p>156
00:32:09.830 --&gt; 00:32:16.550
Haki Benita: Let's go to the database and check the size of the index. Okay? So the size of the index.</p>
<p>157
00:32:16.740 --&gt; 00:32:22.669
Haki Benita: Okay, stay with me. 47 MB. If you remember the previous</p>
<p>158
00:32:23.050 --&gt; 00:32:28.779
Haki Benita: use case, we had an index on all the heads. It was 7 MB. I told you it was large.</p>
<p>159
00:32:28.950 --&gt; 00:32:44.159
Haki Benita: This index on the same amount of rows is 47 MB. That's very, very big, and the reason that it's very, very big is that the URL is very, very big, right? The beach index</p>
<p>160
00:32:44.390 --&gt; 00:32:49.879
Haki Benita: holds the actual values in the leaf block. So if we are indexing.</p>
<p>161
00:32:50.020 --&gt; 00:32:58.219
Haki Benita: A column with very large values like Urls, can be very, very big. So if we are indexing</p>
<p>162
00:32:58.430 --&gt; 00:33:03.490
Haki Benita: a column with very, very big values, these values are also present in the index.</p>
<p>163
00:33:04.000 --&gt; 00:33:14.130
Haki Benita: and the index can get very, very big. So previously, when we were indexing integers, it was 7 MB. Now we're indexing large pieces of text Urls.</p>
<p>164
00:33:14.410 --&gt; 00:33:18.940
Haki Benita: and that's 47 MB. Okay, so</p>
<p>165
00:33:19.430 --&gt; 00:33:28.389
Haki Benita: let's pause for a second. Okay, I know that btree is like the magic for 90% of the use cases. But there are other types of indexes that we can use.</p>
<p>166
00:33:28.955 --&gt; 00:33:32.949
Haki Benita: So let's pause for a second and ask ourselves, what do we know about.</p>
<p>167
00:33:33.210 --&gt; 00:33:48.990
Haki Benita: what do we know about the URL? Okay? So 1st of all, we know that URL is not unique. Right? We can have multiple keys pointing to the same URL. We can have, for example, different campaigns with different short Urls</p>
<p>168
00:33:49.100 --&gt; 00:33:55.800
Haki Benita: pointing to the same URL. There's no restriction in the system. You can have many keys pointing to the same URL. So it's not unique.</p>
<p>169
00:33:55.930 --&gt; 00:33:57.940
Haki Benita: However, however.</p>
<p>170
00:33:59.780 --&gt; 00:34:06.770
Haki Benita: if we actually look at the data, we see that we don't have a lot of duplicate long Urls right</p>
<p>171
00:34:06.970 --&gt; 00:34:07.889
Haki Benita: like.</p>
<p>172
00:34:09.444 --&gt; 00:34:18.389
Haki Benita: It's not likely that people will use the same show to a lot to point to the same URL like at the at the very least.</p>
<p>173
00:34:18.650 --&gt; 00:34:22.639
Haki Benita: they would have different utm parameters for the same. URL.</p>
<p>174
00:34:22.780 --&gt; 00:34:33.040
Haki Benita: So while it's it's, it's not a restriction. You can have many keys, pointing to the same URL. It's not likely, so we don't have a lot of duplicate values.</p>
<p>175
00:34:34.199 --&gt; 00:34:36.219
Haki Benita: So now I want to introduce you</p>
<p>176
00:34:36.710 --&gt; 00:35:00.369
Haki Benita: to what I call the Ugly Duckling of index types in postgres, the Hash Index. Okay? And to understand how a hash index works and why it's different from B 3 index. Let's start by actually building a hash index ourselves. So imagine we have these values, A, BC and D, and we want to index them using a hash index.</p>
<p>177
00:35:00.730 --&gt; 00:35:20.800
Haki Benita: So we start by applying a hash function on each value. So postgres in our example, it has different hash functions for different types. So you can see that we have hash for text char arrays, even Json types, Timestamps, and so on.</p>
<p>178
00:35:20.930 --&gt; 00:35:34.680
Haki Benita: In our case we have just one character. So it uses hashchar. If we actually apply this function on the values we get the hash values. The next step is we want to divide these</p>
<p>179
00:35:34.870 --&gt; 00:35:36.829
Haki Benita: values into buckets.</p>
<p>180
00:35:37.030 --&gt; 00:35:43.100
Haki Benita: So we start by dividing them into 2 buckets. Basically, we apply modular 2 on</p>
<p>181
00:35:44.050 --&gt; 00:36:04.600
Haki Benita: on the hash value, and then we assign each value to a bucket. So we can see that A goes to bucket one and BC and D goes to bucket 0. So this is our hash index. Okay, so we have 3 hash values in bucket 0, each hash value points to</p>
<p>182
00:36:04.860 --&gt; 00:36:10.809
Haki Benita: somewhere in the table. Okay, just like we had the Tids in the B tree. We have</p>
<p>183
00:36:10.980 --&gt; 00:36:32.230
Haki Benita: the tids right here in the hash index. Now, if we want to use this hash index to find some value, we do the exact same thing, but the opposite, but the other way around, right? So if you want to search for the value B, for example, we apply a hash function on it. We get the hash value. We apply modulus number of buckets to get the</p>
<p>184
00:36:32.360 --&gt; 00:36:54.430
Haki Benita: bucket. In this case 0, and then we go to bucket 0 and we start scanning the pointers to find matching hash. Once we found a matching hash, we can take this 2, which is a pointer to a place in the table, and we can go scan this row and look for matching rows. Okay, so this is how a hash index works in postgres.</p>
<p>185
00:36:55.190 --&gt; 00:37:14.639
Haki Benita: Now, if we want to create a hash index in Django. We need to use the special hash index from postgres contrip. Okay? The reason for that is that hash index is not the default index type in postgres. So we need to explicitly say, we want a hash index. Okay.</p>
<p>186
00:37:15.260 --&gt; 00:37:19.239
Haki Benita: so in this case we are creating a hash index on the URL field.</p>
<p>187
00:37:19.770 --&gt; 00:37:46.360
Haki Benita: and the name of this index is going to be short. URL Hix. I like to use a suffix that indicates the type of the index. So when you know, when I look at execution planes, I can quickly identify the type of the index. So I usually use Ix for B. 3 indexes, and then I use part Ix. For partial hix for hash indexes, and so on. You can come up with whatever convention you want.</p>
<p>188
00:37:47.920 --&gt; 00:37:48.900
Haki Benita: So</p>
<p>189
00:37:49.530 --&gt; 00:38:00.809
Haki Benita: we generate the migration, we apply the migration and produce an execution plan. And we can see that Postgres is using our hash index. Okay? Now.</p>
<p>190
00:38:00.940 --&gt; 00:38:01.990
Haki Benita: okay.</p>
<p>191
00:38:02.180 --&gt; 00:38:18.460
Haki Benita: 1st observation. This is very, very fast. Okay, so you can see that 0 point 0 7 ms. That's very, very fast. But that's not all. If we look at the size of our hash index. Compared</p>
<p>192
00:38:18.730 --&gt; 00:38:34.859
Haki Benita: to the Beecher index, we can see that the hash index is 30% smaller. Okay, trust me, I took a calculator, an old casio. And I calculated the difference. It's 30% smaller. Okay, that's very, very significant. Okay.</p>
<p>193
00:38:35.340 --&gt; 00:38:37.929
Haki Benita: if we put all the data in a table.</p>
<p>194
00:38:38.180 --&gt; 00:38:46.570
Haki Benita: You can see that the hash index in this case, with both faster and smaller.</p>
<p>195
00:38:46.860 --&gt; 00:38:47.990
Haki Benita: So that's</p>
<p>196
00:38:48.170 --&gt; 00:39:06.030
Haki Benita: a win-win all around. Okay, faster and smaller than the default. B, 3. Index. Now, I did a little experiment. Okay. So what I did was, I created a hash index and a btree index on the key and on the URL. Okay, you can see the the chart right here.</p>
<p>197
00:39:06.490 --&gt; 00:39:35.660
Haki Benita: I have a hash index on the key. I have a hash index on the URL, I have a B tree on the key, and I have a B tree on the URL. And what I did is I started adding rows to the table. Okay, you can see at the bottom the bottom axis. That's the number of rows. So I started adding rows into the table until I get to a million rows. Now, every time I added rows to the table I took a snapshot of the sizes of the hash index of all the indexes, and then I put this</p>
<p>198
00:39:35.740 --&gt; 00:39:39.649
Haki Benita: all the data in this chart, and we can see some</p>
<p>199
00:39:39.740 --&gt; 00:39:43.597
Haki Benita: interesting things. Okay. 1st of all.</p>
<p>200
00:39:44.510 --&gt; 00:39:46.580
Haki Benita: 1st of all, if you look at the.</p>
<p>201
00:39:47.000 --&gt; 00:39:49.219
Haki Benita: If you look at the red line.</p>
<p>202
00:39:49.470 --&gt; 00:39:52.999
Haki Benita: which is the B tree on the URL big piece of text.</p>
<p>203
00:39:53.690 --&gt; 00:40:18.479
Haki Benita: and the green line which is the B tree on the key the short piece of text. 1st of all, you can see that both of them grow basically linearly as I add more rows to the table, right? So we can see like this linear line increasing right? As I add, more rows, the size of the index increases. We can also see that the red line, the B tree on the URL is always larger.</p>
<p>204
00:40:18.850 --&gt; 00:40:21.239
Haki Benita: the the B tree on the key right?</p>
<p>205
00:40:21.780 --&gt; 00:40:30.559
Haki Benita: So the reason for that is that the URL is a big piece of text, and the key is a short piece of text. This tells us</p>
<p>206
00:40:30.890 --&gt; 00:40:33.730
Haki Benita: that the size of the bee tree is</p>
<p>207
00:40:33.840 --&gt; 00:40:36.900
Haki Benita: very much affected by the size</p>
<p>208
00:40:37.250 --&gt; 00:40:40.240
Haki Benita: of the column that it indexes.</p>
<p>209
00:40:40.380 --&gt; 00:40:49.959
Haki Benita: So A B tree on URL will be bigger than A B tree on key for the same amount of rows, because a URL is bigger than a key.</p>
<p>210
00:40:50.270 --&gt; 00:40:56.780
Haki Benita: So that's about the B 2 indexes. However, if we look at the hash indexes. That's the blue.</p>
<p>211
00:40:57.900 --&gt; 00:40:59.700
Haki Benita: the yellow lines.</p>
<p>212
00:41:00.190 --&gt; 00:41:02.260
Haki Benita: 1st of all, we can see that</p>
<p>213
00:41:03.480 --&gt; 00:41:10.410
Haki Benita: the size of the hash index, if I add more rows is not affected by the size of the value.</p>
<p>214
00:41:10.540 --&gt; 00:41:18.259
Haki Benita: because URL is big key small. But as I add more rows to the table. The size of the hash index is the same. Okay.</p>
<p>215
00:41:18.400 --&gt; 00:41:27.409
Haki Benita: The second thing that I can see is that in this specific case the hash index was consistently lower, smaller.</p>
<p>216
00:41:27.690 --&gt; 00:41:35.050
Haki Benita: Then the same index, the same B, 3 index on the same column. Okay. So in this case the hash index was always smaller.</p>
<p>217
00:41:35.520 --&gt; 00:41:40.680
Haki Benita: Another thing that we can see in this chart that, unlike the B 3 index that grows linearly.</p>
<p>218
00:41:41.050 --&gt; 00:41:48.299
Haki Benita: the hash index grows in like steps. Right? You can see the step, and then it's flat. Step flat.</p>
<p>219
00:41:48.700 --&gt; 00:42:09.099
Haki Benita: So what's happening in a hash index is once we have, we start adding rows to the hash index, and then we have some bucket, and this bucket starts to fill up. Now, when a bucket fills up, postgres, needs to split this bucket. Now, when the bucket is split, postgres, pre allocates</p>
<p>220
00:42:09.580 --&gt; 00:42:12.570
Haki Benita: storage disk space for this bucket.</p>
<p>221
00:42:12.700 --&gt; 00:42:16.419
Haki Benita: So the steps that you see is the bucket splits</p>
<p>222
00:42:16.540 --&gt; 00:42:21.430
Haki Benita: where postgres allocates additional storage to split the bucket.</p>
<p>223
00:42:21.770 --&gt; 00:42:22.630
Haki Benita: Right?</p>
<p>224
00:42:22.970 --&gt; 00:42:25.229
Haki Benita: So this is why hash index</p>
<p>225
00:42:25.420 --&gt; 00:42:28.239
Haki Benita: grows in in, in, in steps.</p>
<p>226
00:42:29.060 --&gt; 00:42:35.259
Haki Benita: So hash index is ideal. When we have very few duplicates</p>
<p>227
00:42:35.470 --&gt; 00:42:59.300
Haki Benita: in the rows that we want to index, and the reason for that is, if we have lots of duplicates, the values would map to the same bucket, and we won't get the benefit of a hash index. The reason that a hash index made sense in our case is that URL is mostly unique. It's almost unique. Okay, it's not unique by definition. But there's not a lot of duplicates.</p>
<p>228
00:42:59.680 --&gt; 00:43:18.200
Haki Benita: We also saw that, unlike a B tree index, hash index is not affected by the size of the values that it indexes, and the reason for that is that the hash index doesn't actually include the values. It includes hash values. Okay, this is why I can index very, very big values, big strings</p>
<p>229
00:43:18.540 --&gt; 00:43:40.110
Haki Benita: with a relatively small index. Okay, as we saw hash index under some circumstances, can be both smaller and faster than A. B 3 index, and the reason that a lot of people are unfamiliar with a hash index is that prior to Postgres 10, which is already pretty old because we're now at Postgres 17.</p>
<p>230
00:43:40.580 --&gt; 00:44:04.829
Haki Benita: If you went to the documentation for Hash Index, there would be like this huge warning, saying, Beware, do not use hash indexes. They are not production ready. So a lot of developers became used to not using hash indexes, but starting in postgres 10, you can definitely use hash indexes in production. They are production ready, and as we saw, they can be very, very good under some circumstances.</p>
<p>231
00:44:06.160 --&gt; 00:44:12.890
Haki Benita: We're talking about hash indexes. It is very important to also know the restrictions of hash indexes. 1st of all, hash index</p>
<p>232
00:44:14.290 --&gt; 00:44:32.920
Haki Benita: cannot be used to create. You can create a unique hash index, and the reason that you can is that a hash index does not contain the actual values, just hash values. And technically, you can have multiple different values producing the exact same hash value.</p>
<p>233
00:44:33.090 --&gt; 00:44:43.399
Haki Benita: So it can. You can create a unique hash index. However, okay, and that's the comment at the bottom, we can talk about it later. If you want. You can enforce unique</p>
<p>234
00:44:43.680 --&gt; 00:44:47.209
Haki Benita: with the hash index using an exclusion constraint.</p>
<p>235
00:44:47.440 --&gt; 00:44:56.589
Haki Benita: Okay, next, we can't have a composite hash index. We can't have a hash index on multiple columns. Okay?</p>
<p>236
00:44:57.410 --&gt; 00:45:02.989
Haki Benita: And we can use hash index for sorting and range searches, because once again.</p>
<p>237
00:45:03.280 --&gt; 00:45:10.940
Haki Benita: hash index does not contain the actual values. Just the hash values right? So I can't use a hash index for things like.</p>
<p>238
00:45:11.390 --&gt; 00:45:17.379
Haki Benita: you know, between greater than less than and so on. Just equality.</p>
<p>239
00:45:18.540 --&gt; 00:45:24.421
Haki Benita: So quick. Recap just 4 more slides. I promise. Okay,</p>
<p>240
00:45:26.090 --&gt; 00:45:34.610
Haki Benita: when to use indexes. So remember, indexes can make queries faster. We saw that in all of our examples.</p>
<p>241
00:45:34.650 --&gt; 00:45:56.340
Haki Benita: using an index, made the query faster. However, the not free, they come at a cost. You need to maintain this index, and this index maintenance happens when you insert when you update and when you delete. So the more indexes you create, the faster your queries are. But the slower every other operation is</p>
<p>242
00:45:56.500 --&gt; 00:46:18.380
Haki Benita: okay. Another thing to consider, and this is often overlooked. Indexes can be very, very big. They consume a lot of disk space when you go back to your databases. After this talk, please go do slash di plus, and look at the sizes of your index. I think that if you never looked at the size of your indexes.</p>
<p>243
00:46:18.620 --&gt; 00:46:23.349
Haki Benita: You're going to be very much surprised at what you're going to find.</p>
<p>244
00:46:24.180 --&gt; 00:46:41.909
Haki Benita: and finally using an index is not always best. If you have a query that needs to access a large portion of the table. Sometimes it doesn't make sense to use an index for that. Okay, there's no magic number, but, you know.</p>
<p>245
00:46:42.190 --&gt; 00:46:43.480
Haki Benita: keep that in mind.</p>
<p>246
00:46:44.710 --&gt; 00:46:55.220
Haki Benita: So we talked about index types and features. We talked about partial indexes, inclusive between indexes, and we talked about hash index.</p>
<p>247
00:46:55.420 --&gt; 00:47:07.439
Haki Benita: We talked a little bit about how to evaluate performance. I don't know if you noticed, but throughout throughout this presentation we went through the same process over and over again. We start by</p>
<p>248
00:47:07.600 --&gt; 00:47:25.639
Haki Benita: executing some query with, explain, analyze, to get the timing with no indexes. This is basically establishing a baseline right? And then we start by experimenting with different types of indexes. So usually, we start with a B tree. We take a measure of the time using, explain, analyze.</p>
<p>249
00:47:25.640 --&gt; 00:47:40.620
Haki Benita: and then we take the size of the index. We put it all in a nice table. We start experimenting. And once you have all the data organized like that. It's a lot easier to reach a decision on what is the best indexing approach</p>
<p>250
00:47:40.630 --&gt; 00:47:42.499
Haki Benita: for your specific use case.</p>
<p>251
00:47:42.560 --&gt; 00:47:53.119
Haki Benita: And also and hopefully, you remember that indexes performance is not just about speed. As we saw, we can get significant</p>
<p>252
00:47:53.660 --&gt; 00:47:57.540
Haki Benita: disk space reductions with a very, very.</p>
<p>253
00:47:57.600 --&gt; 00:48:09.329
Haki Benita: with a very small price of speed sometimes makes sense to make this compromise. We also, throughout this talk, saw how to use, explain</p>
<p>254
00:48:09.360 --&gt; 00:48:31.259
Haki Benita: how to use, explain, analyze how to debug SQL in Django, and we also saw a lot of execution plans. I don't know if you noticed, but if you've never seen execution plans before, hopefully, when you go back to your system. You start doing, explain, analyze some of the queries you run a lot. You get to actually understand what the database is doing. Now</p>
<p>255
00:48:31.560 --&gt; 00:48:45.659
Haki Benita: in this talk I talked only about inclusive indexes, partial indexes, and hash index, but, in fact, there are many, many different other types of indexes that are exotic and very, very cool. We have</p>
<p>256
00:48:46.330 --&gt; 00:48:56.900
Haki Benita: Brent indexes. We have function based indexes, and we have a lot of different flavors of things that we can do. And you can check out this</p>
<p>257
00:48:57.300 --&gt; 00:49:04.960
Haki Benita: class 3 h packed with astral magic for your benefit and</p>
<p>258
00:49:05.810 --&gt; 00:49:13.720
Haki Benita: finally check me out in all of these places, and I'm happy to take questions or discuss whatever you want.</p>
<p>259
00:49:19.490 --&gt; 00:49:22.113
Gabor Szabo: Whoa, thank you.</p>
<p>260
00:49:23.750 --&gt; 00:49:26.585
Gabor Szabo: Because, yeah.</p>
<p>261
00:49:27.400 --&gt; 00:49:28.630
Haki Benita: Hectic.</p>
<p>262
00:49:30.335 --&gt; 00:49:35.410
Gabor Szabo: Yeah, this is not a question, Hank. His article on Hash Indexes is truly excellent.</p>
<p>263
00:49:35.520 --&gt; 00:49:42.589
Gabor Szabo: I believe it remains one of the top search results for anyone looking for resources on hash indexes.</p>
<p>264
00:49:42.760 --&gt; 00:49:47.639
Haki Benita: It's true, it's true. This is one of the top searches for hash index in postgres.</p>
<p>265
00:49:47.910 --&gt; 00:49:48.340
Gabor Szabo: Yeah.</p>
<p>266
00:49:48.340 --&gt; 00:49:53.060
Haki Benita: Yeah, I managed to catch this trend very, very early on.</p>
<p>267
00:49:54.515 --&gt; 00:49:55.540
Gabor Szabo: Okay.</p>
<p>268
00:49:55.790 --&gt; 00:49:56.270
Haki Benita: Mom.</p>
<p>269
00:49:56.270 --&gt; 00:50:01.189
Gabor Szabo: Comments, questions before we. We close this session.</p>
<p>270
00:50:02.340 --&gt; 00:50:05.000
Gabor Szabo: We know where where to find you.</p>
<p>271
00:50:05.160 --&gt; 00:50:07.829
Gabor Szabo: We have the. We'll have the link.</p>
<p>272
00:50:08.320 --&gt; 00:50:16.320
Gabor Szabo: You can add the links to the post of the of the of the video as well, so people can find find it easily, easily.</p>
<p>273
00:50:17.100 --&gt; 00:50:19.660
Gabor Szabo: and any comments.</p>
<p>274
00:50:19.660 --&gt; 00:50:20.020
Haki Benita: Okay.</p>
<p>275
00:50:20.020 --&gt; 00:50:21.859
Gabor Szabo: Questions, apparently not.</p>
<p>276
00:50:21.860 --&gt; 00:50:24.780
Haki Benita: Yeah, I want to thank you, Gabra, for hosting this meeting.</p>
<p>277
00:50:24.780 --&gt; 00:50:25.650
Gabor Szabo: It was excellent.</p>
<p>278
00:50:26.146 --&gt; 00:50:27.139
Haki Benita: Meet up!</p>
<p>279
00:50:27.140 --&gt; 00:50:32.660
Gabor Szabo: Yeah. Well, yeah. So thank you very much for this presentation.</p>
<p>280
00:50:32.770 --&gt; 00:50:41.470
Gabor Szabo: If anyone has questions, then we'll see how to find the hockey later on in this on this slide, and then we'll put it in under the video.</p>
<p>281
00:50:42.020 --&gt; 00:50:52.750
Gabor Szabo: Thank you for for supporting us. Thank you for being here. Thank you very much to you to giving the presentation, please, like the video and follow the channel. Yeah.</p>
<p>282
00:50:53.020 --&gt; 00:51:10.139
Gabor Szabo: And if you would like to give any presentation, you're welcome to contact me as well, and we'll see how we can schedule a presentation at what time, and and so on. So thank you very much, and</p>
<p>283
00:51:10.430 --&gt; 00:51:15.029
Gabor Szabo: see you at the next meeting next video, whatever.</p>
<p>284
00:51:15.400 --&gt; 00:51:16.869
Gabor Szabo: Thank you. Bye, bye.</p>
<p>285
00:51:16.870 --&gt; 00:51:18.830
Haki Benita: Thank you very much. Everyone. Good night.</p>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>How to Make Your Backend Roar with Haki Benita</title>
    <summary type="html"><![CDATA[]]></summary>
    <updated>2025-02-04T08:30:01Z</updated>
    <pubDate>2025-02-04T08:30:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/how-to-make-your-backend-roar" />
    <id>https://python.code-maven.com/how-to-make-your-backend-roar</id>
    <content type="html"><![CDATA[<p>Developers who are not familiar with databases often dread them and treat them like blackboxes, but fear no more! In this talk I present advanced indexing technics to make your database faster and more efficient.</p>
<p>Indexes are extremely powerful and ORMs like Django and SQLAlchemy provide many ways of harnessing their powers to make queries faster and the database more efficient. In this talk I reveal the secrets of DBAs with some advanced indexing techniques such as partial, function based and inclusive B-Tree indexes, and who knows, maybe even some index types you never heard of before!</p>
<p><img src="images/haki-benita.jpeg" alt="Haki Benita" /></p>
<p><a href="https://www.linkedin.com/in/haki-benita-95327952/">Haki Benita</a></p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/vNjpZsBZqM0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<h2 class="title is-4">Transcript</h2>
<p>1
00:00:00.720 --&gt; 00:00:02.690
Haki Benita: This meeting is being recorded.</p>
<p>2
00:00:03.400 --&gt; 00:00:04.320
Gabor Szabo: Okay.</p>
<p>3
00:00:05.800 --&gt; 00:00:12.250
Gabor Szabo: yeah. So Hi, and welcome to the python Maven, let's call it Python Maven. This is the code Maven</p>
<p>4
00:00:12.500 --&gt; 00:00:41.910
Gabor Szabo: Youtube channel. And we are organizing these meetings in the Codebay Events group, but sort of it has 3 separate sessions, and this is going to be the the Python specific one. My name is Gabor Sabo. I usually teach python and rust and help companies introduce testing, and I also like to organize these events and allow people to share their knowledge with with each other.</p>
<p>5
00:00:42.270 --&gt; 00:00:46.010
Gabor Szabo: You're welcome. I'm really happy that you're here</p>
<p>6
00:00:46.140 --&gt; 00:01:04.909
Gabor Szabo: in this session, listening, as I mentioned earlier, you're welcome to to comment or use the chat and ask questions. And if you're just watching the video recorded on Youtube, then please remember to like the video and follow the channel.</p>
<p>7
00:01:05.080 --&gt; 00:01:11.990
Gabor Szabo: and let's welcome hockey now, and let him introduce you. Introduce yourself and and</p>
<p>8
00:01:12.700 --&gt; 00:01:17.579
Gabor Szabo: and give the presentation. So thank you for accepting the invitation.</p>
<p>9
00:01:18.970 --&gt; 00:01:31.149
Haki Benita: Thank you. Thank you, Gabo. 1st of all, I like the fact that we have this intimate group that we can freely talk. I actually encourage you to consider opening the mics.</p>
<p>10
00:01:31.210 --&gt; 00:02:01.090
Haki Benita: Because I think we can actually have a conversation throughout the presentation. I like to give interactive presentation. Your call. You're the boss, and just a quick introduction about the subject and about myself. So we are going to talk about how to make your back end war. And I want to start by apologizing for the tacky headline. But unfortunately, these types of tacky headlines do work. Believe it or not.</p>
<p>11
00:02:01.610 --&gt; 00:02:09.010
Haki Benita: So. My name is Jake Benita. I'm a software developer and a technical lead. I'm currently leading a team</p>
<p>12
00:02:09.289 --&gt; 00:02:18.949
Haki Benita: of developers working on a very large ticketing platform and Israel serving about one and a half unique</p>
<p>13
00:02:19.580 --&gt; 00:02:32.470
Haki Benita: 1.5 million unique paying users every month. And I also like to write and talk about python performance and databases. And you can find my stuff on my website.</p>
<p>14
00:02:33.110 --&gt; 00:02:47.839
Haki Benita: So today, we are going to talk about some lesser known features of indexes. And we're going to try and understand how they work and when we can and should use them</p>
<p>15
00:02:47.850 --&gt; 00:03:14.629
Haki Benita: to do that, we are going to build a URL shortener together, and we're going to do it in Django. I would say that since this is a talk about python, I'm going to use Django and the Django Orm. But the concepts that I'm going to describe are not specific to Django, and they're not specific to Postgres. Heck. They're not even specific to python. But this is a good environment to explain the concepts with.</p>
<p>16
00:03:15.390 --&gt; 00:03:19.889
Haki Benita: So what is a URL shortener? You probably know about</p>
<p>17
00:03:19.900 --&gt; 00:03:39.330
Haki Benita: other types of URL shorteners? You have bitly. You have the late googl buffer, Li, and so on. Basically, URL shortener is a system that provides a short URL that redirects to a longer URL. Now, why would you want to do that</p>
<p>18
00:03:39.330 --&gt; 00:04:02.240
Haki Benita: first.st If you are operating in text constrained environments like SMS messages or Tweets, you might want to share a very large link. So you want to make it shorter, so it consumes less space. This is where short Urls can be handy. Another nice feature of URL shortening that whenever someone clicks the short URL,</p>
<p>19
00:04:02.240 --&gt; 00:04:16.500
Haki Benita: the URL shorten and redirects to the long URL, and keeps a track of how many people click that link. So if you have something like a campaign that you want to launch, and you want to keep track of how many people clicked your link.</p>
<p>20
00:04:16.820 --&gt; 00:04:20.149
Haki Benita: This is what you would use a URL shortener for</p>
<p>21
00:04:20.310 --&gt; 00:04:48.240
Haki Benita: so to build our URL shortener in Django, we're going to start with this very, very simple model. We are calling the model short URL, we have an Id column which is the primary key. It's just an auto incrementing integer field. We have the key. That's a unique short piece of text that uniquely identifies our short URL. This is the short key at the end of the short URL.</p>
<p>22
00:04:48.500 --&gt; 00:05:07.030
Haki Benita: We then have the URL, which is the long URL, we want to redirect to. We also want to keep track of when the URL was created. We do that using the created at column. And finally, we want to keep track of how many users click the link, and we do that with the hits column</p>
<p>23
00:05:07.180 --&gt; 00:05:08.110
Haki Benita: at the bottom.</p>
<p>24
00:05:08.960 --&gt; 00:05:19.650
Haki Benita: So for our demonstration. So we actually have something to work with. I loaded 1 million short Urls into the table. Okay, now, this is not a lot. But we are going to see, some</p>
<p>25
00:05:20.700 --&gt; 00:05:25.929
Haki Benita: performance gains with just 1 million rows. Okay.</p>
<p>26
00:05:26.810 --&gt; 00:05:33.380
Haki Benita: so this talk is about python. But it's essentially about SQL, so</p>
<p>27
00:05:33.510 --&gt; 00:05:54.859
Haki Benita: in Django, if you want to get the SQL. Generated by Django for a given query set. You can do that by accessing the query, set dot query and print it. In this case I'm doing short URL filter on a specific key dot query. And I can actually get Django to print</p>
<p>28
00:05:55.190 --&gt; 00:05:59.549
Haki Benita: the SQL. That it generated for this query set right.</p>
<p>29
00:06:00.040 --&gt; 00:06:26.740
Haki Benita: So, after viewing the query set, it's also very interesting to see how the database is planning to execute my query. Right? So I can do that by executing the function. Explain. This, translates into an explain query, command in SQL. And what I get in return is not the results of the query, but the execution plan, which is how the database is planning</p>
<p>30
00:06:26.930 --&gt; 00:06:30.979
Haki Benita: to execute my query. Now, when we just use, explain</p>
<p>31
00:06:31.200 --&gt; 00:06:36.260
Haki Benita: the database doesn't actually execute the query. It just produces a plan</p>
<p>32
00:06:36.370 --&gt; 00:06:53.839
Haki Benita: sometimes, especially when we're benchmarking and we're trying to improve performance. It can be useful to produce the execution plan, but also have the database, execute this query and return some useful execution data. For that we can use a slightly different variation of the explain command.</p>
<p>33
00:06:53.970 --&gt; 00:07:13.319
Haki Benita: which is, explain, analyze in Django. You can do that by using. Explain, analyze. True in SQL. Postgres. Specifically, you can do explain, analyze on timing on in parenthesis, following by the query, and then you get some additional information about the execution plan.</p>
<p>34
00:07:13.350 --&gt; 00:07:27.339
Haki Benita: first, st because the database actually executed the query. You can see at the bottom that we get how long it took the database to produce an execution plan in this case that would be 0 point 1 4 0 ms.</p>
<p>35
00:07:27.710 --&gt; 00:07:38.510
Haki Benita: and I also get how long. It took the database to execute the query from start to end. In this case that would be 0 point 0 4 6. Okay.</p>
<p>36
00:07:39.430 --&gt; 00:07:47.120
Haki Benita: Now, in addition to the timing. I'm also getting a very, very interesting piece of information inside the execution plan.</p>
<p>37
00:07:47.260 --&gt; 00:07:53.699
Haki Benita: Okay, what I get is the estimated cost and the actual cost</p>
<p>38
00:07:53.820 --&gt; 00:07:58.059
Haki Benita: that the database encountered while executing the query. So</p>
<p>39
00:07:59.010 --&gt; 00:08:15.400
Haki Benita: discussing the cost-based optimizer is slightly outside the scope of this talk, I would just say that, comparing the expected cost to the actual cost is a very useful measure to try and identify bad execution plans.</p>
<p>40
00:08:16.100 --&gt; 00:08:17.350
Haki Benita: Finally.</p>
<p>41
00:08:17.990 --&gt; 00:08:28.419
Haki Benita: another way of viewing queries is to turn on the logger for the database backend in Django. This way, whenever the database, whenever Django executes a query.</p>
<p>42
00:08:29.040 --&gt; 00:08:32.620
Haki Benita: it logs the SQL. That was produced by the aura.</p>
<p>43
00:08:33.510 --&gt; 00:08:34.475
Haki Benita: So</p>
<p>44
00:08:35.700 --&gt; 00:09:05.329
Haki Benita: to actually start discussing some indexing techniques, we need to start implementing some. You know, business processes. So let's start with the most basic thing that URL shortener actually does. And that's look up the URL to redirect to by a key. So a user uses one of our short Urls, we get the unique key. And we need to find the long URL to redirect to. Okay, this is like the bread and butter of this system.</p>
<p>45
00:09:05.440 --&gt; 00:09:27.109
Haki Benita: So if we want to implement this very, very simple function. We can do something like that. Def resolve. Okay, that's the name of the function. We want to resolve a key to a URL. We accept a key, and then we execute this simple query to just get a show, URL for this key. If we don't find anything we return none. Otherwise we return the URL to redirect to</p>
<p>46
00:09:27.110 --&gt; 00:09:37.730
Haki Benita: okay. Now we want to look at the SQL. That Django generated for this function. Right? So we execute this function on some random key</p>
<p>47
00:09:37.950 --&gt; 00:09:57.950
Haki Benita: with SQL. Logging turned on, and we can see the query right here. Now, if you look at this query, it looks like Django, basically fetch everything from the short URL table for the key that we asked for right select star from short URL, where Key equals something.</p>
<p>48
00:09:58.270 --&gt; 00:10:05.050
Haki Benita: If we want to look at how postgres is actually executing this query.</p>
<p>49
00:10:05.210 --&gt; 00:10:12.719
Haki Benita: we can use the explain command. And what we get is that Postgres is planning to use an index scan</p>
<p>50
00:10:13.535 --&gt; 00:10:20.159
Haki Benita: on the index we have on the key column. Okay, now.</p>
<p>51
00:10:21.180 --&gt; 00:10:28.839
Haki Benita: to understand what exactly it means in index scan. Let's take a second to talk about btree index.</p>
<p>52
00:10:29.040 --&gt; 00:10:42.120
Haki Benita: So Btree index is like the king of all indexes. This is the default index in most database engines. If you're not sure what type of index you're using. You're probably using a btree index. Okay?</p>
<p>53
00:10:42.560 --&gt; 00:11:11.160
Haki Benita: So to understand how A B tree index works. Let's start by building one. So imagine you have these values, one through 9, and you want to create a B tree index on them. You start by sorting the values and storing them in leaf blocks. You can see the leaf blocks at the bottom. They are sorted from left to right. We have 1, 2, 3, all the way through 9. Now every entry in the leaf blocks contains a list of tids. These are pointers to rows in the table.</p>
<p>54
00:11:11.400 --&gt; 00:11:15.460
Haki Benita: That store rows with these values. Okay.</p>
<p>55
00:11:16.290 --&gt; 00:11:28.179
Haki Benita: now, above the leaves, we have branches and root block that acts as a directory to these leaf blocks. So let's see how this works. Let's imagine that we want to look.</p>
<p>56
00:11:28.180 --&gt; 00:11:38.290
Gabor Szabo: Sorry. Just someone says says that it doesn't see this the slides. So I just wanted to. And I'm unsure if the other people do see the light slide. So if</p>
<p>57
00:11:38.670 --&gt; 00:11:53.529
Gabor Szabo: I asked it in the chat, but no one answered. So I hope that people other people okay. So some other people see it so my recommendation is to Eduard Eduardo to turn, maybe on and off the I mean, maybe exit zoom and enter zoom again. Sorry for the.</p>
<p>58
00:11:53.530 --&gt; 00:11:54.940
Haki Benita: Okay, no problem.</p>
<p>59
00:11:55.120 --&gt; 00:11:56.160
Haki Benita: Yeah.</p>
<p>60
00:11:56.400 --&gt; 00:11:59.700
Haki Benita: Okay, okay, so let's</p>
<p>61
00:12:01.690 --&gt; 00:12:31.100
Haki Benita: okay. So let's try to search for the value 5 in the V 3 index that we just built. So we start with the root block and we start scanning from left to right. So 5 is larger than 3. So we skip the 1st entry 3 is between 3 and 7, 5 is between 3 and 7, so we follow this pointer to the middle leaf block. We then start scanning the leaf block from left to right. The 1st value is 4. It's not a match.</p>
<p>62
00:12:31.100 --&gt; 00:12:36.150
Haki Benita: The next value is 5. That's a match, and now we can</p>
<p>63
00:12:36.150 --&gt; 00:12:47.970
Haki Benita: scan. We can follow the pointers from this leaf block to the rows in the table. We can read the rows and do whatever we need to do with these rows. Okay, now.</p>
<p>64
00:12:48.310 --&gt; 00:13:15.100
Haki Benita: let's go back to our query. Okay, one second, yeah. Let's go back to our query. Remember that we said that Django generated this query and this query is fetching everything right, basically select star from short URL. But, in fact, if you think about it, we don't actually care about all these fields right? We only care about the URL. I mean, we're not looking to resolve</p>
<p>65
00:13:16.290 --&gt; 00:13:27.129
Haki Benita: a key to a URL for the purpose of redirecting. I don't care when it was created. I don't care about the Id. I already have the key right, and I don't care about the head counter at this point</p>
<p>66
00:13:27.610 --&gt; 00:13:30.209
Haki Benita: right? So I don't care about all these fields. So</p>
<p>67
00:13:30.770 --&gt; 00:13:55.089
Haki Benita: one thing that we can do is instead of fetching all of these fields, how about if we just fetch what we actually need. Right? So in Django, we can do that by adding values list. URL. Now the function is slightly different. But if we look at the SQL. Generated by this function, we can see that now, instead of fetching all the columns in the row, we just fetch the URL. So this is exactly what we need.</p>
<p>68
00:13:55.200 --&gt; 00:14:10.249
Haki Benita: If we look at this execution plan once again for this query, we can see that again. Django is using postgres is using an index scan on the unique index that we have on the key. Right? So now.</p>
<p>69
00:14:10.920 --&gt; 00:14:30.719
Haki Benita: once we found a matching row, we can follow the pointer to the table. We can get the URL from the table. So if you imagine the amount of disk reads, I need to do to satisfy this query. I'm starting by reading their root block. Right? So that's 1 read. Then I need to follow the branch all the way to the leaf. Let's say that we have just.</p>
<p>70
00:14:30.730 --&gt; 00:14:41.789
Haki Benita: you know, root block, and then directly to the leaf. So reading the leaf is another read, and then we need to follow the link from the leaf block to read the row from the table. So this is a unique</p>
<p>71
00:14:41.970 --&gt; 00:14:52.020
Haki Benita: column. So we have at most one row. So that's another read. So basically, we did 3 random reads to satisfy this query right now.</p>
<p>72
00:14:53.290 --&gt; 00:15:03.019
Haki Benita: this query is executed a lot. This is basically what our system is doing right. It's getting keys and resolving them to Urls to redirect right</p>
<p>73
00:15:03.360 --&gt; 00:15:17.979
Haki Benita: now. We already established that all we care about in this specific scenario is just the URL. I don't care about anything else. I care just about the URL. So what if? And stay with me? This is mind blowing.</p>
<p>74
00:15:17.980 --&gt; 00:15:34.249
Haki Benita: What if, instead of going to the table to get the URL. What if I could include the URL in the leaf block in the index this way? When I found a matching entry in the leaf block, I would have the URL just sitting there.</p>
<p>75
00:15:34.310 --&gt; 00:15:52.420
Haki Benita: Right? So this mind-blowing idea is called inclusive index. Okay, in other databases it's called covering index or inclusive indexes, and what it allows us to do, it allows us to store additional information in the leaf block.</p>
<p>76
00:15:52.500 --&gt; 00:16:14.569
Haki Benita: So if we want to use an inclusive index in Django, we can add the include argument to the unique constraint. Now look, the key is indexed. The URL is not indexed. It's just included in the leaf block. Okay. Now, if we generate a migration, we apply it and we try the query again.</p>
<p>77
00:16:15.500 --&gt; 00:16:21.569
Haki Benita: You can see that once again, Postgres is using our index, our unique index on the key. But there is</p>
<p>78
00:16:21.900 --&gt; 00:16:33.889
Haki Benita: very, very subtle difference here. If you notice. Previously we had an index scan using our unique index. This time we have an index only scan.</p>
<p>79
00:16:34.020 --&gt; 00:17:03.620
Haki Benita: This means that Postgres was able to satisfy the query without accessing the table. All the data that it needs was already in the leaf block. So if we once again imagine how many reads we need to do to satisfy this query, using the inclusive index, we read the root block. We follow the pointer all the way down to the leaf block, and now, instead of going to the table to read the URL. We have the URL right there in the leaf block. So we only need to read</p>
<p>80
00:17:03.670 --&gt; 00:17:05.849
Haki Benita: 2 blocks from disk.</p>
<p>81
00:17:06.150 --&gt; 00:17:17.110
Haki Benita: Okay, the way to identify. This is by the operator on the index only scan right? So we have an index scan, and we have an index. Only scan.</p>
<p>82
00:17:18.170 --&gt; 00:17:39.170
Haki Benita: So quick recap about inclusive indexes, as I mentioned in other databases. They are sometimes called covering indexes, and they allow us to fulfill queries without accessing the table. However, you should use them with caution. Because if you think about it, we're basically duplicating data from the table to the index. Okay?</p>
<p>83
00:17:39.170 --&gt; 00:17:49.959
Haki Benita: So if you have a very big big piece of like information like URL can be very, very big. So basically, I'm now storing the URL</p>
<p>84
00:17:50.140 --&gt; 00:18:09.440
Haki Benita: twice. So the index could get very, very big. I'm actually not a big fan of inclusive indexes. But I can think of 2 scenarios where it might be a good idea. First, st if you have very wide tables. Imagine, like data, warehouse type of tables, denormalized tables.</p>
<p>85
00:18:09.600 --&gt; 00:18:11.520
Haki Benita: and you have a very</p>
<p>86
00:18:12.250 --&gt; 00:18:22.290
Haki Benita: predefined set of queries that are executed very, very often on a very, very small subset of columns, you can consider doing using</p>
<p>87
00:18:23.440 --&gt; 00:18:50.249
Haki Benita: an inclusive index. And also, I personally found that non unique composite indexes can be good candidates for inclusive indexes that is, indexes on multiple columns that are not used to enforce a unique constraint. Sometimes they can benefit from switching to just a composite index to an inclusive index. Okay, questions so far before we move on to the next use case.</p>
<p>88
00:18:55.710 --&gt; 00:19:02.210
Haki Benita: Okay, if you have any questions, feel free, let's move on to the next to the next use case.</p>
<p>89
00:19:02.800 --&gt; 00:19:04.080
Haki Benita: So now</p>
<p>90
00:19:04.230 --&gt; 00:19:16.229
Haki Benita: we want to find unused keys right? We have this business question. We want to know how many show through ours. We have with no hits at all. Okay, we have 0 hits.</p>
<p>91
00:19:17.070 --&gt; 00:19:23.050
Haki Benita: So we start by implementing this very, very simple function. We call it, find unusedindexes.</p>
<p>92
00:19:23.350 --&gt; 00:19:26.190
Haki Benita: and it returns a query set where</p>
<p>93
00:19:26.790 --&gt; 00:19:43.480
Haki Benita: with short Urls, where hits equals 0. Once again, if we want to see what the query looks like we can print the result of query. We can see that it returns like star from short URL, where hits equals 0.</p>
<p>94
00:19:44.560 --&gt; 00:19:58.929
Haki Benita: Once again, through the process, we produce an execution plan. This time. We can see that Postgres is doing a sequential scan on short URL. A sequential scan is basically a full table. Scan postgres is just</p>
<p>95
00:19:59.010 --&gt; 00:20:18.369
Haki Benita: reading the table row by row, looking for rows where hits equals 0. We can see that the execution time at the bottom is 116 ms. Let's say, for the sake of discussion, that this is very, very slow, and we want to try and improve that.</p>
<p>96
00:20:18.450 --&gt; 00:20:48.250
Haki Benita: So if you go to like 99% of developers at Dba, they will tell you what's the problem and just slap a B tree on it. Right. So we add a B tree index on the hits column. We do that in Django using dB index equals. True, we generate a migration. We apply the migration. We once again produce the execution plan with, analyze, and lo and behold.</p>
<p>97
00:20:48.310 --&gt; 00:20:56.180
Haki Benita: Postgres is using our index short. URL hits Ix. And, as you can see the execution. Time</p>
<p>98
00:20:56.810 --&gt; 00:21:02.370
Haki Benita: is very, very fast compared to before, so we're done right.</p>
<p>99
00:21:03.230 --&gt; 00:21:06.060
Haki Benita: We can call it the day we can go for lunch.</p>
<p>100
00:21:06.330 --&gt; 00:21:08.609
Haki Benita: We're happy. It's fast. Now</p>
<p>101
00:21:09.310 --&gt; 00:21:20.299
Haki Benita: stop, let's take a second to talk about performance and what it actually means. Okay? Because intuitively, when we talk about performance, we talk about</p>
<p>102
00:21:20.380 --&gt; 00:21:37.639
Haki Benita: speed right? We want things to be very, very quick. But I think, or the way I view performance is that we need to balance different types of resources. And I want to illustrate this with an example. Okay, let's say that you have this batch processing job running at night.</p>
<p>103
00:21:37.640 --&gt; 00:21:53.420
Haki Benita: Now, this batch processing job runs at the middle of the night, where you have very, very little users, and it runs very, very fast. It takes like this batch processing job like 10 seconds to complete. You're so happy, so fast. However.</p>
<p>104
00:21:53.720 --&gt; 00:22:05.569
Haki Benita: however, this job consumes huge amounts of memory, huge amounts of CPU and huge amounts of disk space right. What if I told you that</p>
<p>105
00:22:06.440 --&gt; 00:22:12.950
Haki Benita: if we are willing to compromise, and instead of completing in 10 seconds, it takes a minute</p>
<p>106
00:22:13.410 --&gt; 00:22:38.970
Haki Benita: right? It consumes very little memory disk space and CPU, right? I'm guessing that if you pay a lot of money for memory, you are willing to make this compromise. Okay, I'll give you another example. Let's say that you have this background job running in the middle of the day. Right now, this background job consumes a lot of CPU so much CPU, in fact, that it starts to interfere with user traffic in the system.</p>
<p>107
00:22:39.030 --&gt; 00:23:07.120
Haki Benita: In this case, instead of optimizing for time, you might be optimizing for CPU, right? You're willing to compromise a few seconds. But you don't want the background job to consume a lot of CPU. So when we talk about performance. We talk about more than just speed. We're talking about how we can balance different resources in the system, usually depending on some type of context time of day the type of resource that we have available at this time. Right?</p>
<p>108
00:23:07.670 --&gt; 00:23:23.450
Haki Benita: So remember that we slapped A B tree on it right? And it was very, very fast, but I'm not sure that was like the most optimal thing that we could done. We could have done. So. Let's go to the database and see</p>
<p>109
00:23:23.580 --&gt; 00:23:33.769
Haki Benita: and check the size of the index we created to solve this teeny, tiny problem. Okay, so this index.</p>
<p>110
00:23:34.570 --&gt; 00:23:41.979
Haki Benita: right is 7 MB. Okay, so that's pretty big for for this type of index.</p>
<p>111
00:23:42.120 --&gt; 00:23:47.420
Haki Benita: So our 7 MB index includes</p>
<p>112
00:23:47.630 --&gt; 00:23:57.789
Haki Benita: all the rows in the table. Right? We just add a dB index through create a B tree index on the column. So it contains all the 1 million rows in the table. But</p>
<p>113
00:23:58.570 --&gt; 00:24:05.790
Haki Benita: we actually don't care about all the rows in the table. Right? Nobody asked us how many</p>
<p>114
00:24:06.150 --&gt; 00:24:25.690
Haki Benita: short Urls you have with less than 5 hits, or more than 266 hits, or exactly 1,000 hits. Nobody cares about that. We had a very specific question that we wanted to answer in regards to the hits. We wanted to find how many short Urls we have with exactly 0 hits.</p>
<p>115
00:24:26.100 --&gt; 00:24:37.350
Haki Benita: So what if, instead of indexing the all the rows in the table, we could index just a portion of the rows, the part of the table that we actually care about.</p>
<p>116
00:24:37.810 --&gt; 00:24:51.950
Haki Benita: Right? So this is a once again mind-blowing idea, and this is made possible with something called partial indexes. Partial indexes, allows us to index just a part of the table that we actually care about.</p>
<p>117
00:24:52.810 --&gt; 00:25:08.019
Haki Benita: So going back to our Django model right. 1st we start by removing the dB index from the column definition, you should never use dB index. Regardless of this, and then, instead of adding this default index. On the column.</p>
<p>118
00:25:08.020 --&gt; 00:25:28.989
Haki Benita: we add a proper index. Right? But we add a condition. Okay, so what this does, it creates an index on the Id column with a condition where hits equals 0. This would cause postgres to create an index just on the rows that satisfy this query. Just on rows</p>
<p>119
00:25:29.200 --&gt; 00:25:54.569
Haki Benita: where hits equal 0. Right? So we generate the migration. We apply the migration, and we try the query again. We produce an execution plan, and we can see that Postgres is using our index. Right? We see an index scan using short URL unused part Ix. This is the index we just created. Okay, so Postgres is able to use the index we just created the partial index</p>
<p>120
00:25:55.000 --&gt; 00:26:04.670
Haki Benita: to satisfy this very specific query. We can also see that the query is very, very fast, even compared to the full index. Right?</p>
<p>121
00:26:05.090 --&gt; 00:26:13.180
Haki Benita: But that wasn't the motivation here, right? This is not what we look to optimize. If we go back</p>
<p>122
00:26:13.320 --&gt; 00:26:28.990
Haki Benita: to the database, and we look at the size of this index. Look at that. The partial index is just 88 kB in size. Okay? Previously the full index was 7 MB. The partial index is 88 kB.</p>
<p>123
00:26:28.990 --&gt; 00:26:48.659
Haki Benita: So I did the math seriously. I opened excel. I did the math. That's 99% smaller. Okay, so that's a lot of space. Now, at this point you're probably saying, Come on, man, it's just 7 MB. Who cares? But if you go back to your system, and you have huge tables with hundreds and billions of rows. Right?</p>
<p>124
00:26:48.840 --&gt; 00:27:06.290
Haki Benita: Check the size of your B 3 indexes. They can become huge. I've seen situations where the B 3 index was larger than the table. Okay, and if you have a lot of indexes it can grow out of control very, very quickly.</p>
<p>125
00:27:07.020 --&gt; 00:27:21.090
Haki Benita: So, as you may guess, I'm a very, very big fan of partial indexes. They produce smaller indexes, and I highly encourage you to use them whenever possible. One limitation of partial indexes is that</p>
<p>126
00:27:22.030 --&gt; 00:27:26.349
Haki Benita: the database can only use partial indexes when</p>
<p>127
00:27:26.500 --&gt; 00:27:52.249
Haki Benita: the query uses the exact same condition as the predicate in the index. Right? The database is not even smart enough to do something like like, where hits equal one minus one. Okay to this level. Okay, so it's limited to queries that use the exact same condition. Usually it's fine, because you know, why would you do hits equal one minus one.</p>
<p>128
00:27:52.380 --&gt; 00:27:53.080
Haki Benita: I don't know.</p>
<p>129
00:27:53.520 --&gt; 00:27:58.490
Haki Benita: I personally found that noble columns are great candidates</p>
<p>130
00:27:58.780 --&gt; 00:28:09.290
Haki Benita: for partial indexes, because in postgres, for example, null values are indexed, and usually you don't want to use an index for is null queries. So I found that</p>
<p>131
00:28:09.480 --&gt; 00:28:34.749
Haki Benita: whenever I have a noble column with an index on it, I can benefit from making it a partial indexes. In fact, I wrote an entire article on how we save 20 GB of unused disk space simply by identifying noble columns with indexes and switching them to use partial index. Okay, so questions about partial indexes before we move on to the next use case.</p>
<p>132
00:28:36.780 --&gt; 00:28:38.540
Haki Benita: Gabor, you have a question.</p>
<p>133
00:28:42.160 --&gt; 00:28:42.730
Haki Benita: No.</p>
<p>134
00:28:42.730 --&gt; 00:28:45.249
Gabor Szabo: Sorry there is this sorry? Actually, there is this question.</p>
<p>135
00:28:46.340 --&gt; 00:29:04.110
Haki Benita: Oh, is it a good idea to recalculate the hits and partial indexes? How frequently! Well, the nice thing about indexes and btrees in general that they are always in sync with the data in the table, it's actually part of the transaction. So when you, for example, increment.</p>
<p>136
00:29:05.180 --&gt; 00:29:07.990
Haki Benita: when you increment the counter for the 1st time</p>
<p>137
00:29:08.290 --&gt; 00:29:11.070
Haki Benita: the row would just disappear from the index.</p>
<p>138
00:29:11.250 --&gt; 00:29:26.029
Haki Benita: Right? So I'm guessing that you're asking, because you have some experience with like materialized views and stuff like that. So you don't actually have to maintain it actively. It's just maintained by the database.</p>
<p>139
00:29:26.460 --&gt; 00:29:32.839
Haki Benita: It's truly an amazing feature. You should definitely use that any more questions before we move on to</p>
<p>140
00:29:33.140 --&gt; 00:29:36.009
Haki Benita: a very exotic type of index in postgres.</p>
<p>141
00:29:36.750 --&gt; 00:29:38.110
Haki Benita: Ow.</p>
<p>142
00:29:41.210 --&gt; 00:29:46.360
Haki Benita: okay, great. So let's talk about another type. Another use case.</p>
<p>143
00:29:47.270 --&gt; 00:30:00.790
Haki Benita: So first, st in the 1st use case, we wanted to resolve the key to a URL right? This is the redirect action. This time we want to do a reverse. Look up. We want to ask</p>
<p>144
00:30:01.000 --&gt; 00:30:09.090
Haki Benita: how many keys we have pointing to this specific URL. So we wanna search for keys by the URL.</p>
<p>145
00:30:09.530 --&gt; 00:30:20.539
Haki Benita: So we implement this very simple function called reverse lookup. It accepts a URL and returns query, set of short Urls. Okay?</p>
<p>146
00:30:21.210 --&gt; 00:30:49.150
Haki Benita: So if we want to see what the query looks like. We use dot query. And we can see select star from short URL where URL equals something. Okay, if we produce an execution plan. We can see that the database is doing a sequential scan on the short URL table that is, scanning the entire table, sifting row by row, finding matches for our query.</p>
<p>147
00:30:49.430 --&gt; 00:30:50.800
Haki Benita: Whoa!</p>
<p>148
00:30:51.590 --&gt; 00:30:55.929
Haki Benita: And we can see that it's relatively</p>
<p>149
00:30:56.140 --&gt; 00:31:00.379
Haki Benita: slow, right? It's like 105</p>
<p>150
00:31:00.500 --&gt; 00:31:03.990
Haki Benita: milliseconds so compared to the index</p>
<p>151
00:31:04.320 --&gt; 00:31:08.840
Haki Benita: queries that we saw before. That's that's pretty slow. Right?</p>
<p>152
00:31:09.220 --&gt; 00:31:23.659
Haki Benita: So you know, once again, 99% of the people would just say, Come on, man, I'm hungry. Let's order some food. Just slop a B tree on it. So this is what we do right? We start by adding A B tree on the URL</p>
<p>153
00:31:23.860 --&gt; 00:31:37.679
Haki Benita: right? We generate and apply the migration. Now we execute the exact same query again, and we can see that now Postgres is using the index that we just created. We can see an index scan using</p>
<p>154
00:31:38.030 --&gt; 00:31:57.059
Haki Benita: the index on the URL column, and also it's very fast. Previously it was like a 100 ms. Now it's 0 point 1 ms. So that's a very, very big and significant improvement. We can all seek to launch and be very, very happy and satisfied with ourselves. But</p>
<p>155
00:31:57.770 --&gt; 00:32:09.459
Haki Benita: are we done? Do you think that we are done? Is there anything that we can optimize? Now, if you are paying attention throughout this presentation. You know that we can definitely do better than that.</p>
<p>156
00:32:09.830 --&gt; 00:32:16.550
Haki Benita: Let's go to the database and check the size of the index. Okay? So the size of the index.</p>
<p>157
00:32:16.740 --&gt; 00:32:22.669
Haki Benita: Okay, stay with me. 47 MB. If you remember the previous</p>
<p>158
00:32:23.050 --&gt; 00:32:28.779
Haki Benita: use case, we had an index on all the heads. It was 7 MB. I told you it was large.</p>
<p>159
00:32:28.950 --&gt; 00:32:44.159
Haki Benita: This index on the same amount of rows is 47 MB. That's very, very big, and the reason that it's very, very big is that the URL is very, very big, right? The beach index</p>
<p>160
00:32:44.390 --&gt; 00:32:49.879
Haki Benita: holds the actual values in the leaf block. So if we are indexing.</p>
<p>161
00:32:50.020 --&gt; 00:32:58.219
Haki Benita: A column with very large values like Urls, can be very, very big. So if we are indexing</p>
<p>162
00:32:58.430 --&gt; 00:33:03.490
Haki Benita: a column with very, very big values, these values are also present in the index.</p>
<p>163
00:33:04.000 --&gt; 00:33:14.130
Haki Benita: and the index can get very, very big. So previously, when we were indexing integers, it was 7 MB. Now we're indexing large pieces of text Urls.</p>
<p>164
00:33:14.410 --&gt; 00:33:18.940
Haki Benita: and that's 47 MB. Okay, so</p>
<p>165
00:33:19.430 --&gt; 00:33:28.389
Haki Benita: let's pause for a second. Okay, I know that btree is like the magic for 90% of the use cases. But there are other types of indexes that we can use.</p>
<p>166
00:33:28.955 --&gt; 00:33:32.949
Haki Benita: So let's pause for a second and ask ourselves, what do we know about.</p>
<p>167
00:33:33.210 --&gt; 00:33:48.990
Haki Benita: what do we know about the URL? Okay? So 1st of all, we know that URL is not unique. Right? We can have multiple keys pointing to the same URL. We can have, for example, different campaigns with different short Urls</p>
<p>168
00:33:49.100 --&gt; 00:33:55.800
Haki Benita: pointing to the same URL. There's no restriction in the system. You can have many keys pointing to the same URL. So it's not unique.</p>
<p>169
00:33:55.930 --&gt; 00:33:57.940
Haki Benita: However, however.</p>
<p>170
00:33:59.780 --&gt; 00:34:06.770
Haki Benita: if we actually look at the data, we see that we don't have a lot of duplicate long Urls right</p>
<p>171
00:34:06.970 --&gt; 00:34:07.889
Haki Benita: like.</p>
<p>172
00:34:09.444 --&gt; 00:34:18.389
Haki Benita: It's not likely that people will use the same show to a lot to point to the same URL like at the at the very least.</p>
<p>173
00:34:18.650 --&gt; 00:34:22.639
Haki Benita: they would have different utm parameters for the same. URL.</p>
<p>174
00:34:22.780 --&gt; 00:34:33.040
Haki Benita: So while it's it's, it's not a restriction. You can have many keys, pointing to the same URL. It's not likely, so we don't have a lot of duplicate values.</p>
<p>175
00:34:34.199 --&gt; 00:34:36.219
Haki Benita: So now I want to introduce you</p>
<p>176
00:34:36.710 --&gt; 00:35:00.369
Haki Benita: to what I call the Ugly Duckling of index types in postgres, the Hash Index. Okay? And to understand how a hash index works and why it's different from B 3 index. Let's start by actually building a hash index ourselves. So imagine we have these values, A, BC and D, and we want to index them using a hash index.</p>
<p>177
00:35:00.730 --&gt; 00:35:20.800
Haki Benita: So we start by applying a hash function on each value. So postgres in our example, it has different hash functions for different types. So you can see that we have hash for text char arrays, even Json types, Timestamps, and so on.</p>
<p>178
00:35:20.930 --&gt; 00:35:34.680
Haki Benita: In our case we have just one character. So it uses hashchar. If we actually apply this function on the values we get the hash values. The next step is we want to divide these</p>
<p>179
00:35:34.870 --&gt; 00:35:36.829
Haki Benita: values into buckets.</p>
<p>180
00:35:37.030 --&gt; 00:35:43.100
Haki Benita: So we start by dividing them into 2 buckets. Basically, we apply modular 2 on</p>
<p>181
00:35:44.050 --&gt; 00:36:04.600
Haki Benita: on the hash value, and then we assign each value to a bucket. So we can see that A goes to bucket one and BC and D goes to bucket 0. So this is our hash index. Okay, so we have 3 hash values in bucket 0, each hash value points to</p>
<p>182
00:36:04.860 --&gt; 00:36:10.809
Haki Benita: somewhere in the table. Okay, just like we had the Tids in the B tree. We have</p>
<p>183
00:36:10.980 --&gt; 00:36:32.230
Haki Benita: the tids right here in the hash index. Now, if we want to use this hash index to find some value, we do the exact same thing, but the opposite, but the other way around, right? So if you want to search for the value B, for example, we apply a hash function on it. We get the hash value. We apply modulus number of buckets to get the</p>
<p>184
00:36:32.360 --&gt; 00:36:54.430
Haki Benita: bucket. In this case 0, and then we go to bucket 0 and we start scanning the pointers to find matching hash. Once we found a matching hash, we can take this 2, which is a pointer to a place in the table, and we can go scan this row and look for matching rows. Okay, so this is how a hash index works in postgres.</p>
<p>185
00:36:55.190 --&gt; 00:37:14.639
Haki Benita: Now, if we want to create a hash index in Django. We need to use the special hash index from postgres contrip. Okay? The reason for that is that hash index is not the default index type in postgres. So we need to explicitly say, we want a hash index. Okay.</p>
<p>186
00:37:15.260 --&gt; 00:37:19.239
Haki Benita: so in this case we are creating a hash index on the URL field.</p>
<p>187
00:37:19.770 --&gt; 00:37:46.360
Haki Benita: and the name of this index is going to be short. URL Hix. I like to use a suffix that indicates the type of the index. So when you know, when I look at execution planes, I can quickly identify the type of the index. So I usually use Ix for B. 3 indexes, and then I use part Ix. For partial hix for hash indexes, and so on. You can come up with whatever convention you want.</p>
<p>188
00:37:47.920 --&gt; 00:37:48.900
Haki Benita: So</p>
<p>189
00:37:49.530 --&gt; 00:38:00.809
Haki Benita: we generate the migration, we apply the migration and produce an execution plan. And we can see that Postgres is using our hash index. Okay? Now.</p>
<p>190
00:38:00.940 --&gt; 00:38:01.990
Haki Benita: okay.</p>
<p>191
00:38:02.180 --&gt; 00:38:18.460
Haki Benita: 1st observation. This is very, very fast. Okay, so you can see that 0 point 0 7 ms. That's very, very fast. But that's not all. If we look at the size of our hash index. Compared</p>
<p>192
00:38:18.730 --&gt; 00:38:34.859
Haki Benita: to the Beecher index, we can see that the hash index is 30% smaller. Okay, trust me, I took a calculator, an old casio. And I calculated the difference. It's 30% smaller. Okay, that's very, very significant. Okay.</p>
<p>193
00:38:35.340 --&gt; 00:38:37.929
Haki Benita: if we put all the data in a table.</p>
<p>194
00:38:38.180 --&gt; 00:38:46.570
Haki Benita: You can see that the hash index in this case, with both faster and smaller.</p>
<p>195
00:38:46.860 --&gt; 00:38:47.990
Haki Benita: So that's</p>
<p>196
00:38:48.170 --&gt; 00:39:06.030
Haki Benita: a win-win all around. Okay, faster and smaller than the default. B, 3. Index. Now, I did a little experiment. Okay. So what I did was, I created a hash index and a btree index on the key and on the URL. Okay, you can see the the chart right here.</p>
<p>197
00:39:06.490 --&gt; 00:39:35.660
Haki Benita: I have a hash index on the key. I have a hash index on the URL, I have a B tree on the key, and I have a B tree on the URL. And what I did is I started adding rows to the table. Okay, you can see at the bottom the bottom axis. That's the number of rows. So I started adding rows into the table until I get to a million rows. Now, every time I added rows to the table I took a snapshot of the sizes of the hash index of all the indexes, and then I put this</p>
<p>198
00:39:35.740 --&gt; 00:39:39.649
Haki Benita: all the data in this chart, and we can see some</p>
<p>199
00:39:39.740 --&gt; 00:39:43.597
Haki Benita: interesting things. Okay. 1st of all.</p>
<p>200
00:39:44.510 --&gt; 00:39:46.580
Haki Benita: 1st of all, if you look at the.</p>
<p>201
00:39:47.000 --&gt; 00:39:49.219
Haki Benita: If you look at the red line.</p>
<p>202
00:39:49.470 --&gt; 00:39:52.999
Haki Benita: which is the B tree on the URL big piece of text.</p>
<p>203
00:39:53.690 --&gt; 00:40:18.479
Haki Benita: and the green line which is the B tree on the key the short piece of text. 1st of all, you can see that both of them grow basically linearly as I add more rows to the table, right? So we can see like this linear line increasing right? As I add, more rows, the size of the index increases. We can also see that the red line, the B tree on the URL is always larger.</p>
<p>204
00:40:18.850 --&gt; 00:40:21.239
Haki Benita: the the B tree on the key right?</p>
<p>205
00:40:21.780 --&gt; 00:40:30.559
Haki Benita: So the reason for that is that the URL is a big piece of text, and the key is a short piece of text. This tells us</p>
<p>206
00:40:30.890 --&gt; 00:40:33.730
Haki Benita: that the size of the bee tree is</p>
<p>207
00:40:33.840 --&gt; 00:40:36.900
Haki Benita: very much affected by the size</p>
<p>208
00:40:37.250 --&gt; 00:40:40.240
Haki Benita: of the column that it indexes.</p>
<p>209
00:40:40.380 --&gt; 00:40:49.959
Haki Benita: So A B tree on URL will be bigger than A B tree on key for the same amount of rows, because a URL is bigger than a key.</p>
<p>210
00:40:50.270 --&gt; 00:40:56.780
Haki Benita: So that's about the B 2 indexes. However, if we look at the hash indexes. That's the blue.</p>
<p>211
00:40:57.900 --&gt; 00:40:59.700
Haki Benita: the yellow lines.</p>
<p>212
00:41:00.190 --&gt; 00:41:02.260
Haki Benita: 1st of all, we can see that</p>
<p>213
00:41:03.480 --&gt; 00:41:10.410
Haki Benita: the size of the hash index, if I add more rows is not affected by the size of the value.</p>
<p>214
00:41:10.540 --&gt; 00:41:18.259
Haki Benita: because URL is big key small. But as I add more rows to the table. The size of the hash index is the same. Okay.</p>
<p>215
00:41:18.400 --&gt; 00:41:27.409
Haki Benita: The second thing that I can see is that in this specific case the hash index was consistently lower, smaller.</p>
<p>216
00:41:27.690 --&gt; 00:41:35.050
Haki Benita: Then the same index, the same B, 3 index on the same column. Okay. So in this case the hash index was always smaller.</p>
<p>217
00:41:35.520 --&gt; 00:41:40.680
Haki Benita: Another thing that we can see in this chart that, unlike the B 3 index that grows linearly.</p>
<p>218
00:41:41.050 --&gt; 00:41:48.299
Haki Benita: the hash index grows in like steps. Right? You can see the step, and then it's flat. Step flat.</p>
<p>219
00:41:48.700 --&gt; 00:42:09.099
Haki Benita: So what's happening in a hash index is once we have, we start adding rows to the hash index, and then we have some bucket, and this bucket starts to fill up. Now, when a bucket fills up, postgres, needs to split this bucket. Now, when the bucket is split, postgres, pre allocates</p>
<p>220
00:42:09.580 --&gt; 00:42:12.570
Haki Benita: storage disk space for this bucket.</p>
<p>221
00:42:12.700 --&gt; 00:42:16.419
Haki Benita: So the steps that you see is the bucket splits</p>
<p>222
00:42:16.540 --&gt; 00:42:21.430
Haki Benita: where postgres allocates additional storage to split the bucket.</p>
<p>223
00:42:21.770 --&gt; 00:42:22.630
Haki Benita: Right?</p>
<p>224
00:42:22.970 --&gt; 00:42:25.229
Haki Benita: So this is why hash index</p>
<p>225
00:42:25.420 --&gt; 00:42:28.239
Haki Benita: grows in in, in, in steps.</p>
<p>226
00:42:29.060 --&gt; 00:42:35.259
Haki Benita: So hash index is ideal. When we have very few duplicates</p>
<p>227
00:42:35.470 --&gt; 00:42:59.300
Haki Benita: in the rows that we want to index, and the reason for that is, if we have lots of duplicates, the values would map to the same bucket, and we won't get the benefit of a hash index. The reason that a hash index made sense in our case is that URL is mostly unique. It's almost unique. Okay, it's not unique by definition. But there's not a lot of duplicates.</p>
<p>228
00:42:59.680 --&gt; 00:43:18.200
Haki Benita: We also saw that, unlike a B tree index, hash index is not affected by the size of the values that it indexes, and the reason for that is that the hash index doesn't actually include the values. It includes hash values. Okay, this is why I can index very, very big values, big strings</p>
<p>229
00:43:18.540 --&gt; 00:43:40.110
Haki Benita: with a relatively small index. Okay, as we saw hash index under some circumstances, can be both smaller and faster than A. B 3 index, and the reason that a lot of people are unfamiliar with a hash index is that prior to Postgres 10, which is already pretty old because we're now at Postgres 17.</p>
<p>230
00:43:40.580 --&gt; 00:44:04.829
Haki Benita: If you went to the documentation for Hash Index, there would be like this huge warning, saying, Beware, do not use hash indexes. They are not production ready. So a lot of developers became used to not using hash indexes, but starting in postgres 10, you can definitely use hash indexes in production. They are production ready, and as we saw, they can be very, very good under some circumstances.</p>
<p>231
00:44:06.160 --&gt; 00:44:12.890
Haki Benita: We're talking about hash indexes. It is very important to also know the restrictions of hash indexes. 1st of all, hash index</p>
<p>232
00:44:14.290 --&gt; 00:44:32.920
Haki Benita: cannot be used to create. You can create a unique hash index, and the reason that you can is that a hash index does not contain the actual values, just hash values. And technically, you can have multiple different values producing the exact same hash value.</p>
<p>233
00:44:33.090 --&gt; 00:44:43.399
Haki Benita: So it can. You can create a unique hash index. However, okay, and that's the comment at the bottom, we can talk about it later. If you want. You can enforce unique</p>
<p>234
00:44:43.680 --&gt; 00:44:47.209
Haki Benita: with the hash index using an exclusion constraint.</p>
<p>235
00:44:47.440 --&gt; 00:44:56.589
Haki Benita: Okay, next, we can't have a composite hash index. We can't have a hash index on multiple columns. Okay?</p>
<p>236
00:44:57.410 --&gt; 00:45:02.989
Haki Benita: And we can use hash index for sorting and range searches, because once again.</p>
<p>237
00:45:03.280 --&gt; 00:45:10.940
Haki Benita: hash index does not contain the actual values. Just the hash values right? So I can't use a hash index for things like.</p>
<p>238
00:45:11.390 --&gt; 00:45:17.379
Haki Benita: you know, between greater than less than and so on. Just equality.</p>
<p>239
00:45:18.540 --&gt; 00:45:24.421
Haki Benita: So quick. Recap just 4 more slides. I promise. Okay,</p>
<p>240
00:45:26.090 --&gt; 00:45:34.610
Haki Benita: when to use indexes. So remember, indexes can make queries faster. We saw that in all of our examples.</p>
<p>241
00:45:34.650 --&gt; 00:45:56.340
Haki Benita: using an index, made the query faster. However, the not free, they come at a cost. You need to maintain this index, and this index maintenance happens when you insert when you update and when you delete. So the more indexes you create, the faster your queries are. But the slower every other operation is</p>
<p>242
00:45:56.500 --&gt; 00:46:18.380
Haki Benita: okay. Another thing to consider, and this is often overlooked. Indexes can be very, very big. They consume a lot of disk space when you go back to your databases. After this talk, please go do slash di plus, and look at the sizes of your index. I think that if you never looked at the size of your indexes.</p>
<p>243
00:46:18.620 --&gt; 00:46:23.349
Haki Benita: You're going to be very much surprised at what you're going to find.</p>
<p>244
00:46:24.180 --&gt; 00:46:41.909
Haki Benita: and finally using an index is not always best. If you have a query that needs to access a large portion of the table. Sometimes it doesn't make sense to use an index for that. Okay, there's no magic number, but, you know.</p>
<p>245
00:46:42.190 --&gt; 00:46:43.480
Haki Benita: keep that in mind.</p>
<p>246
00:46:44.710 --&gt; 00:46:55.220
Haki Benita: So we talked about index types and features. We talked about partial indexes, inclusive between indexes, and we talked about hash index.</p>
<p>247
00:46:55.420 --&gt; 00:47:07.439
Haki Benita: We talked a little bit about how to evaluate performance. I don't know if you noticed, but throughout throughout this presentation we went through the same process over and over again. We start by</p>
<p>248
00:47:07.600 --&gt; 00:47:25.639
Haki Benita: executing some query with, explain, analyze, to get the timing with no indexes. This is basically establishing a baseline right? And then we start by experimenting with different types of indexes. So usually, we start with a B tree. We take a measure of the time using, explain, analyze.</p>
<p>249
00:47:25.640 --&gt; 00:47:40.620
Haki Benita: and then we take the size of the index. We put it all in a nice table. We start experimenting. And once you have all the data organized like that. It's a lot easier to reach a decision on what is the best indexing approach</p>
<p>250
00:47:40.630 --&gt; 00:47:42.499
Haki Benita: for your specific use case.</p>
<p>251
00:47:42.560 --&gt; 00:47:53.119
Haki Benita: And also and hopefully, you remember that indexes performance is not just about speed. As we saw, we can get significant</p>
<p>252
00:47:53.660 --&gt; 00:47:57.540
Haki Benita: disk space reductions with a very, very.</p>
<p>253
00:47:57.600 --&gt; 00:48:09.329
Haki Benita: with a very small price of speed sometimes makes sense to make this compromise. We also, throughout this talk, saw how to use, explain</p>
<p>254
00:48:09.360 --&gt; 00:48:31.259
Haki Benita: how to use, explain, analyze how to debug SQL in Django, and we also saw a lot of execution plans. I don't know if you noticed, but if you've never seen execution plans before, hopefully, when you go back to your system. You start doing, explain, analyze some of the queries you run a lot. You get to actually understand what the database is doing. Now</p>
<p>255
00:48:31.560 --&gt; 00:48:45.659
Haki Benita: in this talk I talked only about inclusive indexes, partial indexes, and hash index, but, in fact, there are many, many different other types of indexes that are exotic and very, very cool. We have</p>
<p>256
00:48:46.330 --&gt; 00:48:56.900
Haki Benita: Brent indexes. We have function based indexes, and we have a lot of different flavors of things that we can do. And you can check out this</p>
<p>257
00:48:57.300 --&gt; 00:49:04.960
Haki Benita: class 3 h packed with astral magic for your benefit and</p>
<p>258
00:49:05.810 --&gt; 00:49:13.720
Haki Benita: finally check me out in all of these places, and I'm happy to take questions or discuss whatever you want.</p>
<p>259
00:49:19.490 --&gt; 00:49:22.113
Gabor Szabo: Whoa, thank you.</p>
<p>260
00:49:23.750 --&gt; 00:49:26.585
Gabor Szabo: Because, yeah.</p>
<p>261
00:49:27.400 --&gt; 00:49:28.630
Haki Benita: Hectic.</p>
<p>262
00:49:30.335 --&gt; 00:49:35.410
Gabor Szabo: Yeah, this is not a question, Hank. His article on Hash Indexes is truly excellent.</p>
<p>263
00:49:35.520 --&gt; 00:49:42.589
Gabor Szabo: I believe it remains one of the top search results for anyone looking for resources on hash indexes.</p>
<p>264
00:49:42.760 --&gt; 00:49:47.639
Haki Benita: It's true, it's true. This is one of the top searches for hash index in postgres.</p>
<p>265
00:49:47.910 --&gt; 00:49:48.340
Gabor Szabo: Yeah.</p>
<p>266
00:49:48.340 --&gt; 00:49:53.060
Haki Benita: Yeah, I managed to catch this trend very, very early on.</p>
<p>267
00:49:54.515 --&gt; 00:49:55.540
Gabor Szabo: Okay.</p>
<p>268
00:49:55.790 --&gt; 00:49:56.270
Haki Benita: Mom.</p>
<p>269
00:49:56.270 --&gt; 00:50:01.189
Gabor Szabo: Comments, questions before we. We close this session.</p>
<p>270
00:50:02.340 --&gt; 00:50:05.000
Gabor Szabo: We know where where to find you.</p>
<p>271
00:50:05.160 --&gt; 00:50:07.829
Gabor Szabo: We have the. We'll have the link.</p>
<p>272
00:50:08.320 --&gt; 00:50:16.320
Gabor Szabo: You can add the links to the post of the of the of the video as well, so people can find find it easily, easily.</p>
<p>273
00:50:17.100 --&gt; 00:50:19.660
Gabor Szabo: and any comments.</p>
<p>274
00:50:19.660 --&gt; 00:50:20.020
Haki Benita: Okay.</p>
<p>275
00:50:20.020 --&gt; 00:50:21.859
Gabor Szabo: Questions, apparently not.</p>
<p>276
00:50:21.860 --&gt; 00:50:24.780
Haki Benita: Yeah, I want to thank you, Gabra, for hosting this meeting.</p>
<p>277
00:50:24.780 --&gt; 00:50:25.650
Gabor Szabo: It was excellent.</p>
<p>278
00:50:26.146 --&gt; 00:50:27.139
Haki Benita: Meet up!</p>
<p>279
00:50:27.140 --&gt; 00:50:32.660
Gabor Szabo: Yeah. Well, yeah. So thank you very much for this presentation.</p>
<p>280
00:50:32.770 --&gt; 00:50:41.470
Gabor Szabo: If anyone has questions, then we'll see how to find the hockey later on in this on this slide, and then we'll put it in under the video.</p>
<p>281
00:50:42.020 --&gt; 00:50:52.750
Gabor Szabo: Thank you for for supporting us. Thank you for being here. Thank you very much to you to giving the presentation, please, like the video and follow the channel. Yeah.</p>
<p>282
00:50:53.020 --&gt; 00:51:10.139
Gabor Szabo: And if you would like to give any presentation, you're welcome to contact me as well, and we'll see how we can schedule a presentation at what time, and and so on. So thank you very much, and</p>
<p>283
00:51:10.430 --&gt; 00:51:15.029
Gabor Szabo: see you at the next meeting next video, whatever.</p>
<p>284
00:51:15.400 --&gt; 00:51:16.869
Gabor Szabo: Thank you. Bye, bye.</p>
<p>285
00:51:16.870 --&gt; 00:51:18.830
Haki Benita: Thank you very much. Everyone. Good night.</p>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>The Reference Model for COVID-19 attempts to explain USA data with Jacob Barhak</title>
    <summary type="html"><![CDATA[]]></summary>
    <updated>2025-01-28T08:30:01Z</updated>
    <pubDate>2025-01-28T08:30:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/covid-19-with-jacob-barhak" />
    <id>https://python.code-maven.com/covid-19-with-jacob-barhak</id>
    <content type="html"><![CDATA[<p>Speaker: <a href="https://sites.google.com/view/jacob-barhak/home">Jacob Barhak</a></p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/Ea_e6xI48Ik" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<p>The Reference Model for disease progression was initially a diabetes model. It used the approach of assembling models and validating them against different populations from clinical trials.</p>
<p>The model performs simulation at the individual level while modeling entire populations using the MIcro-Simulation Tool (MIST), employing High Performance Computing (HPC), and using machine learning techniques to combine models.</p>
<p>The Reference Model technology was transformed to model COVID-19 near the start of the epidemic. The model is now composed of multiple models from multiple contributors that represent different phenomena: It includes infectiousness models, transmission models, human response / behavior models, hospitalization models, mortality models, and observation models. Some of those models were calculated at different scales including cell scale, organ scale, individual scale, and population scale.</p>
<p>The Reference Model has therefore reached the achievement of being the first known multi-scale ensemble model for COVID-19. This project is ongoing and this presentation is constantly updated for each venue. To access the most recent publication please use <a href="https://www.clinicalunitmapping.com/show/COVID19_Ensemble_Latest.html">this link</a></p>
<h2 class="title is-4">Bio:</h2>
<p>Jacob Barhak is an independent Computational Disease Modeler focusing on machine comprehension of clinical data. The Reference Model for disease progression is patented technology that was self developed by Dr. Barhak. The Reference model is the most validated Diabetes model known worldwide and also the first COVID-19 multi-scale ensemble model. His efforts also include standardizing clinical data through ClinicalUnitMapping.com and he is the developer of the Micro Simulation Tool (MIST). Dr. Barhak has a diverse international background in engineering and computing science. He is active within the python community and organizes the Austin Evening of Python Coding meetup. For additional information <a href="https://sites.google.com/view/jacob-barhak/home">please visit</a></p>
<p><img src="images/jacob-barhak.jpg" alt="" /></p>
<p><a href="https://www.preprints.org/manuscript/202411.2193/v1">Lessons Learned from Modeling COVID-19: Steps to Take at the Start of the Next Pandemic[v1]</a></p>
<h2 class="title is-4">Transcript</h2>
<p>1
00:00:02.260 --&gt; 00:00:20.330
Gabor Szabo: Hello, and welcome to the Code Maven Channel on Youtube and our meeting. Thank you for everyone who is arrived to this meeting, and especially to Jacob, who is going to give the presentation. If you are unfamiliar with the Channel, then</p>
<p>2
00:00:20.500 --&gt; 00:00:48.179
Gabor Szabo: we have these live presentations, meetings with live presentations, mostly about stuff related to python and rust programming, and also something about git and version control. And this area my name is Gabor. I'm the host of this. I teach python and rust at corporations. And I also help companies to get started with</p>
<p>3
00:00:48.960 --&gt; 00:00:52.559
Gabor Szabo: test automation and continuous integration</p>
<p>4
00:00:52.630 --&gt; 00:00:56.399
Gabor Szabo: area that that sort of area.</p>
<p>5
00:00:57.730 --&gt; 00:01:04.399
Gabor Szabo: And and we have this meeting, so people can share their their knowledge with each other.</p>
<p>6
00:01:04.540 --&gt; 00:01:11.199
Gabor Szabo: So thank you for arriving. And before I let Jacob start talking about</p>
<p>7
00:01:11.590 --&gt; 00:01:20.060
Gabor Szabo: himself and introducing himself. Please, like the video, follow the channel. I always forget to say this. So now I remembered.</p>
<p>8
00:01:20.200 --&gt; 00:01:23.540
Gabor Szabo: So the thank you, and and it's yours.</p>
<p>9
00:01:24.210 --&gt; 00:01:54.160
Jacob Barhak: Okay, we're going to try to make it as a conversation as much as possible. My name is Jacob Barhack. I've been developing disease models for since 2,006. So it's almost, it's about 19 years now, a bit less than 19 years. And I have technology for disease modeling. Now, disease modeling means computational disease modeling. And I use a lot of python. This is actually, when I was introduced to Python in 2,006, and all of this project was made in python.</p>
<p>10
00:01:54.761 --&gt; 00:02:10.820
Jacob Barhak: It requires a lot of computing power, and the idea is to be able to explain diseases. So I started with diabetes. I was actually hired by university to as a programmer to actually write disease modeling software.</p>
<p>11
00:02:11.030 --&gt; 00:02:38.280
Jacob Barhak: I'm still using of an option offshoot of the same software like 19 years later. Now it's called missed the micro simulation tool, and it allows you to simulate many, many individuals going through a disease. And you define what the disease is. I started with diabetes. I actually got to the point that diabetes. I have one of the most sophisticated diabetes models worldwide.</p>
<p>12
00:02:38.280 --&gt; 00:02:44.609
Jacob Barhak: and this is patented technology. And at the end, you will see I have a conflict of interest statement. Because</p>
<p>13
00:02:44.610 --&gt; 00:03:10.790
Jacob Barhak: I believe this technology is worth a lot. So whatever I'm telling you, take it with a grain of salt. This is developed technology, everything that I promise you double check the nice thing about this technology. It does the double checking for you for some degree. We'll explain it later. It uses some AI ideas and technologies to actually implement what's happening here. But it's not the AI that you're familiar with. It's a mix.</p>
<p>14
00:03:11.050 --&gt; 00:03:13.479
Jacob Barhak: And this is why it's patented. So</p>
<p>15
00:03:14.070 --&gt; 00:03:24.709
Jacob Barhak: let me explain what happens. With this technology, I started with diabetes in 2,000 and 2,020</p>
<p>16
00:03:24.830 --&gt; 00:03:29.980
Jacob Barhak: Covid arrived to the Us. I was in the Us. And I</p>
<p>17
00:03:30.140 --&gt; 00:03:54.935
Jacob Barhak: started migrating my technology towards modeling Covid. And now I can explain Covid. But let me tell you what explain means. Oh, by the way, this presentation was given in many places you can follow up how it changed because it did change all of those things. You can download on the links at the end. You'll have a QR code. Actually, I'll show it to you now. But</p>
<p>18
00:03:55.550 --&gt; 00:04:16.769
Jacob Barhak: you will have a QR code that you can download this or actually view it. You'll need a strong machine to actually view it, because it's a huge file. It's like a quarter of a gig size to actually download and view on your browser. But it has everything, including results. And it is interactive. It's an HTML file. I'm using python technology called to actually do this.</p>
<p>19
00:04:16.769 --&gt; 00:04:29.960
Jacob Barhak: But it's less about all of this, more about disease models. Later on. You can ask me about everything else, so you can download and see how things changed, even in my perspective. But now I believe things have been stable for the last</p>
<p>20
00:04:30.390 --&gt; 00:04:38.560
Jacob Barhak: approximately 2 years, so I'm pretty sure I can explain, at least in the Us. What happened with Covid and explain means.</p>
<p>21
00:04:39.030 --&gt; 00:04:51.659
Jacob Barhak: Let's let's take it a step north before I show the model and explain it. Think about it. We might have more pandemics in the future. We probably will actually have them all the time. We have many diseases going on.</p>
<p>22
00:04:51.810 --&gt; 00:05:10.500
Jacob Barhak: But can we explain really what's happening. I was amazed when I started working with the medical people that how little they know about some of the things going on, and the fact is, they are overwhelmed with data. The fact that they can remember and do something good is miraculous at this</p>
<p>23
00:05:10.550 --&gt; 00:05:36.520
Jacob Barhak: speaks a lot about their profession. They are doing the best they can, but they cannot even memorize the medical papers coming out every 6 seconds, and every 6 seconds a new medical paper coming out. There's no way one doctor will remember at all. And I'm not talking about all those medical databases, huge amounts of data that are being accumulated by by bodies like the National Institutes of Health.</p>
<p>24
00:05:37.460 --&gt; 00:05:59.149
Jacob Barhak: All of this means that we need now to help computers help up crunch all this and give us good results and good explanations about what's going on, because the way we are dealing with medicine now will change with the data. It's already changing. And this model and many other tools related to it will change it. So</p>
<p>25
00:05:59.300 --&gt; 00:06:05.260
Jacob Barhak: now let's go back to the model and explain how I can explain things. So</p>
<p>26
00:06:06.270 --&gt; 00:06:11.290
Jacob Barhak: the reference model for disease. Progression is kind of a statistical model.</p>
<p>27
00:06:11.690 --&gt; 00:06:31.800
Jacob Barhak: What it does it says you. Each disease has states it's a state transition model, where you can be either no covid or can be covered, infected. You can recover or die from covid. Notice that there is no error back from recovered to infected, because I'm modeling the beginning of the disease. April 2020.</p>
<p>28
00:06:32.520 --&gt; 00:06:43.590
Jacob Barhak: Because the idea is that the next disease we want to have a tool that will explain it to us in reasonable time, and I believe this is one of the tools that can help do that. So</p>
<p>29
00:06:43.930 --&gt; 00:06:44.650
Jacob Barhak: oh.</p>
<p>30
00:06:45.080 --&gt; 00:07:07.769
Jacob Barhak: I'm I'm trying to extract from the beginning data that people accumulated. I'm using the Covid tracking project data. They allow me to use it even for commercial purposes. And you can actually go and track and see for each State in the Us. How many people got infected, how many people died, and they kept the very pretty good record about it.</p>
<p>31
00:07:08.000 --&gt; 00:07:24.509
Jacob Barhak: and later on there are other organizations that took over. But they were, I believe, the best at start. So I'm using that data. And now I'm I'm trying to get a model that explains all those numbers that they appear, that they that they report</p>
<p>32
00:07:25.000 --&gt; 00:07:25.830
Jacob Barhak: so</p>
<p>33
00:07:26.390 --&gt; 00:07:32.730
Jacob Barhak: to do this, I assume that there are those States, and this is the beginning of pandemic. So there is no reinfection.</p>
<p>34
00:07:33.160 --&gt; 00:07:41.250
Jacob Barhak: and I'm trying to match their numbers. How do I try to match them? Each error has several words about above them. Each word.</p>
<p>35
00:07:41.250 --&gt; 00:07:44.160
Gabor Szabo: Wait a second, Jacob. So- so</p>
<p>36
00:07:44.760 --&gt; 00:07:54.509
Gabor Szabo: Jim is writing that screen share, not showing. If that is wanted. I I see on the screen these boxes of No. Covid.</p>
<p>37
00:07:54.510 --&gt; 00:08:08.830
Jacob Barhak: No Covid Covid infected, possibly, hospitalized Jim. And look, if you have the option of actually choosing what screen you choose. But sometimes you can choose. You can see the shared screen. I can stop the share and share it again. If it's okay with you guys</p>
<p>38
00:08:09.000 --&gt; 00:08:11.570
Jacob Barhak: or Jim, did you find the share.</p>
<p>39
00:08:11.980 --&gt; 00:08:15.290
Jim Mccormack: I'll look, Jacob, don't let me take you out of flow. Go, please proceed. Thank you.</p>
<p>40
00:08:15.290 --&gt; 00:08:22.830
Jacob Barhak: So look at. Look at the link I sent. You can actually bring up the presentation and follow up. I'm on the second tab called introduction.</p>
<p>41
00:08:23.460 --&gt; 00:08:33.630
Jacob Barhak: It's it's an interactive on your machine. You should be able to download that you have good Wi-fi just download it to your machine and you can follow me there if you don't see the presentation.</p>
<p>42
00:08:33.730 --&gt; 00:08:34.389
Jim Mccormack: Got it.</p>
<p>43
00:08:34.390 --&gt; 00:08:35.260
Jim Mccormack: It's loading.</p>
<p>44
00:08:36.049 --&gt; 00:08:38.569
Jacob Barhak: Yeah, I know it takes a minute to load.</p>
<p>45
00:08:39.282 --&gt; 00:09:00.519
Jacob Barhak: It's it's huge, but it has everything encapsulated. Part of the reason is to keep it to a producible as much as possible at the end. You'll see a producibility section. I don't give away all the code, but I do give. Keep track of everything that I noted. Like. You see all those boxes here, those are all the references where you can actually extract the data.</p>
<p>46
00:09:00.599 --&gt; 00:09:14.679
Jacob Barhak: And one, some of the links actually became defect. I found some other links so can show you where I got the data from, or make sure that people can actually try to reconstruct this as much as possible, because we have a reproducibility crisis in science.</p>
<p>47
00:09:14.909 --&gt; 00:09:23.569
Jacob Barhak: anyway, back to the boxes. So on top of those boxes you'll see words like infectious transmission, response, recovery, and mortality.</p>
<p>48
00:09:24.019 --&gt; 00:09:29.869
Jacob Barhak: Each one of those represents not one model, but many models.</p>
<p>49
00:09:30.449 --&gt; 00:09:42.189
Jacob Barhak: The technology that I'm using is called an ensemble model. An ensemble is like choir in in music. Well, ensemble models are very similar. You have. Don't have one model, you have many of them.</p>
<p>50
00:09:42.329 --&gt; 00:09:52.189
Jacob Barhak: so I have many model for infections, many models for transmission, many models for response, many models for recovery and mortality and hospitalization. You'll see later.</p>
<p>51
00:09:52.369 --&gt; 00:10:21.229
Jacob Barhak: But on top of it. I actually have an observer looking at all of this, saying, you know your numbers are wrong. So you have to actually correct them, because your observer. Actually, you have multiple observer models, each one seeing something different, telling you something different about those numbers. Do you incorporate all of those? And now you have many, many models, and you have to run them all and takes quite a bit of computing power. Later on I'll show you I have a computer here still crunching data, because I'm still working on this and making sure that my numbers are okay.</p>
<p>52
00:10:21.569 --&gt; 00:10:22.439
Jacob Barhak: So</p>
<p>53
00:10:24.309 --&gt; 00:10:41.049
Jacob Barhak: all of this takes a lot of computing power, you will see later on. This computation took about 3 years of computation on a single CPU. The ones I'm working on now will take half a half a year on a big 24 core machine 32 threads. So</p>
<p>54
00:10:42.299 --&gt; 00:10:54.739
Jacob Barhak: ever. And when you run it in the cloud it takes many, many, many processors, because this takes a lot of computing power just like AI because it uses AI technology, we're gonna talk about it later in a second.</p>
<p>55
00:10:55.605 --&gt; 00:11:18.749
Jacob Barhak: So I take all those models. I also take information from the cover tracking project about what happened in each State like numbers of and and you see some of those numbers later. I have information from us. Census about each State States are not the same. They different sizes, different population, density, different age age</p>
<p>56
00:11:19.109 --&gt; 00:11:39.639
Jacob Barhak: curves population curves in school. Also, there's information about number of interactions and even the weather. I include the weather as part of the simulations, and later we'll ask some questions. But let's say, show you what the models look like and why they are. They are the way they are. So let's</p>
<p>57
00:11:39.979 --&gt; 00:11:44.819
Jacob Barhak: explain one motivation. Why, this is important. To have a model like this</p>
<p>58
00:11:45.769 --&gt; 00:12:10.249
Jacob Barhak: in the Dhs is the Department of Homeland Security in the Us. During Covid. It's kept a document called the Master Question List about COVID-19, and it always very, very organized way. They say this is what we know, or we think we know. And this is what we want to know. So they had a master question list about question about Covid things they didn't know.</p>
<p>59
00:12:10.729 --&gt; 00:12:24.479
Jacob Barhak: In the 26th of May 2020, in that version of that table of that paper, and which evolved throughout time. By the way, you can actually download it later in the presentation, and check it out in</p>
<p>60
00:12:25.089 --&gt; 00:12:37.069
Jacob Barhak: the versions change, but it still exists somewhere. All those versions you can still find them. The Department of home security kept very, very meticulous record, which is very good. I I give them</p>
<p>61
00:12:37.339 --&gt; 00:12:47.179
Jacob Barhak: a good grade, because this is one of the most important documents, so you can extract information, for from they did very good job. But</p>
<p>62
00:12:47.839 --&gt; 00:12:55.989
Jacob Barhak: even with all this, in the 26th of May they were still asking that question, what is the average infections period during which individual can transmit the disease?</p>
<p>63
00:12:57.259 --&gt; 00:13:07.479
Jacob Barhak: Why is it important? Think about it. You are now the government, and you have to decide. If you have a lockdown, or how much people to how much time do you keep people in curfew?</p>
<p>64
00:13:08.009 --&gt; 00:13:12.569
Jacob Barhak: Or even if they're sick, how much time you keep them not roaming around.</p>
<p>65
00:13:14.069 --&gt; 00:13:16.979
Jacob Barhak: They didn't know they kind of admitted.</p>
<p>66
00:13:18.669 --&gt; 00:13:19.434
Jacob Barhak: So</p>
<p>67
00:13:21.569 --&gt; 00:13:29.999
Jacob Barhak: Since the Department of Homeland Security didn't know those things, and they asked us. And</p>
<p>68
00:13:30.599 --&gt; 00:13:33.819
Jacob Barhak: at this point I started looking for answers.</p>
<p>69
00:13:34.069 --&gt; 00:13:57.659
Jacob Barhak: And actually this happened later on in the pandemic, because some of those models came in like a year later, but some of them existed even before. So you can extract some information which is semi like from this paper from Bai Lee, and let me explain what this curve means. This is the infectious curve. It's relative infection. It tells you how much virus you shed, meaning, how much virus your body</p>
<p>70
00:13:57.969 --&gt; 00:14:02.709
Jacob Barhak: gives away compared to your Max, your Max is one.</p>
<p>71
00:14:03.449 --&gt; 00:14:26.721
Jacob Barhak: So how much virus your body generates, and this is the day. So a day 0 almost nothing. You just got infected. You don't generate the virus, or at least not enough to actually spread around the number is so slow, low you don't see it, and then for next 2 days you don't, and then it starts growing, and then it goes away. This, actually, in this case, in this paper, actually draw, took the information from</p>
<p>72
00:14:27.899 --&gt; 00:14:40.199
Jacob Barhak: I took this information and and manipulated a little bit because it was not exactly like this, but some other models. They actually says, this is how we measured it. Notice how different the models are.</p>
<p>73
00:14:40.199 --&gt; 00:14:59.819
Jacob Barhak: Here's another one. Actually, in this paper. They had, like 6 of those models, but 5 or 6 I don't remember, but each one looked different. So I took 2 from that publication and think about it. Each person also must be behaves differently. When I run the simulations, I run the simulation for each individual, so each individual may be different than another.</p>
<p>74
00:14:59.869 --&gt; 00:15:06.769
Jacob Barhak: but in this situation I assume that all of them have the same infections of the entire cohort, and we are looking at the average.</p>
<p>75
00:15:06.909 --&gt; 00:15:11.699
Jacob Barhak: We can do the simulations not like this, but we're looking for the average curve. So</p>
<p>76
00:15:11.909 --&gt; 00:15:36.589
Jacob Barhak: even if I say, Oh, this is like that, or this is like that, or this person behaves like this, or like this. Some of those papers came in later in the pandemic. But still, once you have this information, or even if you don't have this information, but you have assumptions from other diseases. Say, Oh, you know, when this disease looks like that, and maybe take the infections curve from that disease, you can plug it in.</p>
<p>77
00:15:37.329 --&gt; 00:15:56.959
Jacob Barhak: And what the model does it uses? Ai techniques. Everyone probably is familiar with optimization technique called gradient descent. So using gradient descent after running all of those simulations, it will find the optimal one from all of the models that you plugged in.</p>
<p>78
00:15:57.279 --&gt; 00:16:06.979
Jacob Barhak: Let me explain how it starts at the beginning. You don't know what what curve is dominant, or what is actually correct. What you do is, you assume</p>
<p>79
00:16:07.409 --&gt; 00:16:23.439
Jacob Barhak: all models are the same meaning. If 5 people come to me and tell you a story. You believe them all the same way without knowing anything better. So you just average whatever they're saying. This is the average of all of the models that you saw before.</p>
<p>80
00:16:24.549 --&gt; 00:16:25.659
Jacob Barhak: Now.</p>
<p>81
00:16:26.499 --&gt; 00:16:47.699
Jacob Barhak: During simulation, you actually run simulations and know this is better. This is worse, and little by little, you start optimizing. And after many, many iterations, you take a lot of computing power. This is basically how AI models train using very similar technique. So I'm using the same technique over some other medium which is not a neural network. As you know it.</p>
<p>82
00:16:47.849 --&gt; 00:16:58.359
Jacob Barhak: it's somewhat similar, but not exactly. It's a different state thing, state transition, and it actually runs. It runs a little bit differently. I'll I can go into details if you're interested.</p>
<p>83
00:16:58.589 --&gt; 00:17:15.439
Jacob Barhak: and then you end up with something that looks like this. This is the answer for the Department of Homeland Security. Hopefully, in the future, you will see they will use technology and be able to extract an answer fairly quickly.</p>
<p>84
00:17:15.599 --&gt; 00:17:29.799
Jacob Barhak: and then don't go a long time in the pandemic without knowing by the way, this may change with the numbers. But over time. And of course, there's bios variance and stuff like this. But at least the beginning, you have a basic idea of what's happening.</p>
<p>85
00:17:29.799 --&gt; 00:17:49.269
Jacob Barhak: And this is what was happening according to all the numbers and all the assumptions that you will see coming in later on. Remember, a model is not true. It's an assumption, and what it does, it takes all this ensemble and kind of puts it into the places. This is the most reasonable set of assumptions that matches the best, the data best</p>
<p>86
00:17:50.259 --&gt; 00:17:52.609
Jacob Barhak: anyone has any question. Or can I proceed?</p>
<p>87
00:17:55.109 --&gt; 00:17:56.389
Jacob Barhak: Okay, I'll proceed.</p>
<p>88
00:17:56.839 --&gt; 00:18:12.969
Jacob Barhak: Here's about transmission. Yeah, I'll make it interesting for you if I ask you if you met. It's 1 thing having, in fact, being infectious. But what if I meet April 2020? I have Covid, and let's say I meet one of you for 15 min.</p>
<p>89
00:18:13.269 --&gt; 00:18:25.959
Jacob Barhak: What's the chance of you getting the Covid from me? Don't answer it. I'll tell you. I asked this question many times. Many people, many people tell me 70% or encounter for, say, 15 min.</p>
<p>90
00:18:26.129 --&gt; 00:18:34.129
Jacob Barhak: And then I tell them it's slow. Then look at themselves 50%, and then tell them it's lower. Then they get to 10%. And I tell them it's still lower.</p>
<p>91
00:18:34.409 --&gt; 00:18:39.339
Jacob Barhak: And then they end up amazed that it's only between one and 2%,</p>
<p>92
00:18:39.779 --&gt; 00:19:00.269
Jacob Barhak: because what drives Covid crazy is not the fact that the transmission happens immediately. It's like, if if virus will be so infectious, and everyone will be infected in no time. What what drives it is the fact that we have so many interactions amongst ourselves. So a person meeting, and we have those every day with many, many people.</p>
<p>93
00:19:00.399 --&gt; 00:19:03.109
Jacob Barhak: So if at some point me</p>
<p>94
00:19:03.429 --&gt; 00:19:09.859
Jacob Barhak: having interaction with one person, the chance is very low. But since I meet many people for many days.</p>
<p>95
00:19:10.309 --&gt; 00:19:21.119
Jacob Barhak: this is what drives it the chance for me transmitting 1% over 10 days with many people, it's much, much higher than me. With one person for 15 min.</p>
<p>96
00:19:21.120 --&gt; 00:19:26.539
Gabor Szabo: Doesn't it change? Depending on the on the length of the time you spend with the person.</p>
<p>97
00:19:26.540 --&gt; 00:19:42.980
Jacob Barhak: Yeah, you can argue that you can argue, think about encounter, think about like a 15 min encounter as an average. If you spend the person twice. Then basically, it's not twice the probability, but it's it's very close to twice the probability.</p>
<p>98
00:19:42.980 --&gt; 00:19:50.709
Gabor Szabo: But what I'm saying is that let's say you you meet a hundred people for 1 min versus one person for a hundred minute.</p>
<p>99
00:19:51.370 --&gt; 00:19:58.529
Jacob Barhak: Yeah. So all of those things scales kind of differently. It's, you know, like a Bernoulli test.</p>
<p>100
00:19:59.700 --&gt; 00:20:18.979
Jacob Barhak: It looks like a Bernoulli test that you run many, many times you have a chance of like flipping a coin that is biased, and how many times each one, each period of time, let's say, 15 min. You flip one of those coins. So according to this, you can actually calculate an approximation to the function.</p>
<p>101
00:20:21.220 --&gt; 00:20:28.770
Jacob Barhak: so it's simple statistics that you learned at school. But now it's actually being active, very useful in those cases.</p>
<p>102
00:20:28.770 --&gt; 00:20:52.959
Jacob Barhak: Later on, you put it. We tried in the past to do it in the Marcus model. There are multiple ways to calculate it. And some people come up with different functions. It doesn't matter, really, because all of those are assumptions, all those ones are incorrect to some degree. The question is, which one is most plausible under all of the things that you know, and for this, you need a lot of assumptions, a lot of computing power. And this is what I saw.</p>
<p>103
00:20:53.480 --&gt; 00:21:01.439
Jacob Barhak: So even if it's not about newly test, and someone else comes up with different infections function. I can plug it in and see what happens.</p>
<p>104
00:21:01.720 --&gt; 00:21:03.770
Jacob Barhak: You you understand what I mean.</p>
<p>105
00:21:04.130 --&gt; 00:21:27.249
Jacob Barhak: But in this situation I took multiple functions that took into account individual encounter the population density of the State, some random, constant, just just put plug it in just in case to make some noise. Sometimes it's helpful in some models, at least even to root out some things. And I also added something interesting temperature. Here, let me ask you</p>
<p>106
00:21:27.270 --&gt; 00:21:41.540
Jacob Barhak: what happens? Colder States or warmer States, the transmission is higher. Think about it. If you're in New York or Michigan, or you've Texas or Florida, where is the chances for you to actually transmit the disease? Higher?</p>
<p>107
00:21:42.560 --&gt; 00:21:54.420
Jacob Barhak: Think about it. It matters, and we'll see the answer at the end. Think about the answer to yourselves, but later on you'll you'll get the answer from from me when I show the results.</p>
<p>108
00:21:56.150 --&gt; 00:22:13.710
Jacob Barhak: So I also took into account response models. Pandemics are. Also. It also matters how people behave if you are afraid of the pandemic, and you stick at home and don't go nowhere. Your chances of transmitting or getting disease are much, much lower.</p>
<p>109
00:22:14.640 --&gt; 00:22:15.860
Jacob Barhak: But then.</p>
<p>110
00:22:16.150 --&gt; 00:22:40.989
Jacob Barhak: if you run around and ignore the disease, like many people did, and which is model number 3 here that say, Oh, I don't care about Covid, and then eventually get Covid, and then you're at home and don't see anyone because you're at home, or even worse. If you are not at home, and you just ignore Covid and go around. It's worse. So different people behave differently, actually, different parts of society behave differently same, just like infectiousness curve.</p>
<p>111
00:22:41.140 --&gt; 00:22:42.859
Jacob Barhak: Each one behaves differently.</p>
<p>112
00:22:42.980 --&gt; 00:22:54.039
Jacob Barhak: So now you have different models. So I took 2 models from apple mobility, with different variations on them. Apple mobility data says, Oh, how many people looked at their phones</p>
<p>113
00:22:54.370 --&gt; 00:23:03.850
Jacob Barhak: and press the map button. This indicates they want to go somewhere. It doesn't mean they went. But this kind of an an estimate of how much mobility they had.</p>
<p>114
00:23:03.970 --&gt; 00:23:07.029
Jacob Barhak: So this is, I use those as a base.</p>
<p>115
00:23:07.618 --&gt; 00:23:13.241
Jacob Barhak: Also, I used as a base. Later on came Eric Ferguson.</p>
<p>116
00:23:13.920 --&gt; 00:23:20.750
Jacob Barhak: I I hope I didn't butcher the name. He's from Montclair University. He did a study of us States.</p>
<p>117
00:23:21.170 --&gt; 00:23:28.679
Jacob Barhak: and knew whether they shut that what their shutdown orders were in the States, each State</p>
<p>118
00:23:29.140 --&gt; 00:23:43.180
Jacob Barhak: decided differently. So now you incorporate all of this into the model and says and say whether, you know, non-essential shops were closed, school stay at all models, so on and so forth.</p>
<p>119
00:23:43.360 --&gt; 00:24:08.491
Jacob Barhak: and at different levels of compliance. I entered it as a formula into the model it. It's a little bit more complex. I'm just giving you the idea of what's happening here. He later on published a a good version, the I used an older version that, and I state exactly what the differences are from this newest version. But</p>
<p>120
00:24:09.370 --&gt; 00:24:15.069
Jacob Barhak: This is how. Now I have different models of how the States behave.</p>
<p>121
00:24:15.180 --&gt; 00:24:17.640
Jacob Barhak: and each State behaves differently, of course.</p>
<p>122
00:24:17.970 --&gt; 00:24:28.839
Jacob Barhak: but I have also different models of those, and all of those are part of the mix of models that are playing around. Think about it. All of those models are roaming around and doing things.</p>
<p>123
00:24:29.090 --&gt; 00:24:58.840
Jacob Barhak: then came in and tell me this is fairly recent. He gave me an hospitalization model. I didn't have an hospitalization model meaning people are in hospital. The numbers that you count of people being infected is not very good estimate, because you know the tests are not that you don't test everyone, and so on, and so forth. It does errors here, but people who ended up in the hospital oh, you know they were hospitalized. So if you have a hospitalization model.</p>
<p>124
00:24:58.840 --&gt; 00:25:05.610
Jacob Barhak: It kind of like helps you out. Now, interestingly enough, not all States counted hospitalizations. Well.</p>
<p>125
00:25:05.690 --&gt; 00:25:15.500
Jacob Barhak: but it doesn't matter when they do. I do want better information. So authorization models is something that Kyoti gave me. And then the question is, when is the person?</p>
<p>126
00:25:15.720 --&gt; 00:25:25.680
Jacob Barhak: What's frequency? The person gets hospitalized if they get the disease. So he came up with 3 models, low probability, moderate probability, and high probability, and those depend on age.</p>
<p>127
00:25:25.990 --&gt; 00:25:28.579
Jacob Barhak: as you can see here, and also</p>
<p>128
00:25:28.770 --&gt; 00:25:37.303
Jacob Barhak: whether a person gets hospitalized early or later, meaning how much time it takes them to get hospitalized in each age.</p>
<p>129
00:25:37.880 --&gt; 00:25:38.710
Jacob Barhak: when.</p>
<p>130
00:25:38.870 --&gt; 00:25:56.340
Jacob Barhak: So again, you can run the simulation and find out that the worst summary, if you take the average one, it's not the best one, actually the one with the lower probability the was not as high as we thought, at least according to data and all of the other models that we found.</p>
<p>131
00:25:56.970 --&gt; 00:25:59.740
Jacob Barhak: So all of this you take into account.</p>
<p>132
00:26:00.830 --&gt; 00:26:07.310
Jacob Barhak: Finally, we have mortality models. People die out of Covid, they die, they eventually everything. But then</p>
<p>133
00:26:07.480 --&gt; 00:26:19.582
Jacob Barhak: what frequency? Again, what's chance of dying. And when. So, there's 1 type of modeling saying, we'll take. This is information published by the Cdc. We'll take</p>
<p>134
00:26:20.410 --&gt; 00:26:31.329
Jacob Barhak: between some ever. This is more complicated. I'll just say that it's the mortality, probability and the time, and it doesn't change much</p>
<p>135
00:26:31.930 --&gt; 00:26:32.720
Jacob Barhak: it</p>
<p>136
00:26:32.940 --&gt; 00:26:52.660
Jacob Barhak: but Filippocilione actually did a model about how organs die out of cells dying. This is a multi-model, like a multi-scale model, because he was working in level of cells, but later on tied it all the way up up to the mortality of the person. So it's different levels of size.</p>
<p>137
00:26:52.860 --&gt; 00:27:19.839
Jacob Barhak: So this is why it's called the multi-scale model. And that model, it tells me, in each day after infection. Infections. D. 0, what's the chance of a person dying? So, for example, a 20 or someone an infant dying out of day, 1917, according to his model, is less than one per 1,000 if they get infected. But if you go to a 90 year old</p>
<p>138
00:27:20.460 --&gt; 00:27:30.929
Jacob Barhak: in day 20. It's 1% in day. 19, it's a little bit less. It's also 1%. But then it drops and goes high. This is according to his model.</p>
<p>139
00:27:31.150 --&gt; 00:27:55.689
Jacob Barhak: So now, which one of the ones models correct, so you can mix them up and check it out. And this is what we do. But before we do this we also have to account the fact that the numbers we get are incorrect. How many people here raise of hands? How many people here were saw something about the numbers that they are showing are wrong. Someone is miscounting them. During Covid we all went through Covid.</p>
<p>140
00:27:55.810 --&gt; 00:27:59.739
Jacob Barhak: Come on Upper. Did you ever.</p>
<p>141
00:28:00.060 --&gt; 00:28:04.030
Jim Mccormack: I definitely did. Yeah, I can't raise my hand, but I could raise my voice.</p>
<p>142
00:28:04.270 --&gt; 00:28:12.689
Jacob Barhak: Yeah. So we know that the numbers people claim they're wrong. Some people think we're overestimate some people. They're underestimated, correct.</p>
<p>143
00:28:12.960 --&gt; 00:28:27.870
Jacob Barhak: Everyone had their own opinion, and we don't know what drives those opinions, but we can suspect. But it doesn't matter really. For science we have different numbers, and we don't trust them. How do we correct and say, you know what?</p>
<p>144
00:28:28.190 --&gt; 00:28:31.799
Jacob Barhak: We asked the question, what if it was a different number.</p>
<p>145
00:28:32.250 --&gt; 00:28:39.130
Jacob Barhak: and what different number, for example, we know almost for sure, that the number of infections that we have is</p>
<p>146
00:28:41.430 --&gt; 00:28:47.100
Jacob Barhak: is miscounted. It doesn't represent the probability in the entire society</p>
<p>147
00:28:47.711 --&gt; 00:28:53.980
Jacob Barhak: like probability of not probability. The the proportion of people actually infected in society, because</p>
<p>148
00:28:54.300 --&gt; 00:29:10.980
Jacob Barhak: the tests are always like they have a error. It also have matters when you test the person it. It matters not only when you test the person it's like in in the accuracy of the test, but also how how you conduct the test, how you</p>
<p>149
00:29:11.170 --&gt; 00:29:27.340
Jacob Barhak: like, what your sample, all of what your sample population are. All of those things matter. So we pretty much assume that the numbers are underreported, the number of infections, people that are reported because there are those who never tested. Therefore their numbers didn't appear</p>
<p>150
00:29:27.420 --&gt; 00:29:42.970
Jacob Barhak: so. Now, how much we multiply them. Well, some people multiply by 5. This seems to be like a running number that everyone multiplies in epidemiology. I claimed. Okay, let's try 20. Just if there's 5, let's try 20 and and</p>
<p>151
00:29:43.420 --&gt; 00:29:49.230
Jacob Barhak: Lucas, who gave me this model, actually gave you an infectious model Lucas Botcher. He</p>
<p>152
00:29:49.770 --&gt; 00:30:03.270
Jacob Barhak: he actually looked at. He says that it's also about 1757, 15, you have to multiply in fraction 7, 15. And then there's another model that it's more complicated. Explain. We'll leave it for now it doesn't matter, for now</p>
<p>153
00:30:03.480 --&gt; 00:30:04.600
Jacob Barhak: and then</p>
<p>154
00:30:06.180 --&gt; 00:30:16.249
Jacob Barhak: the mortality is the same thing, same people who died, you know people who died out of a car crash were listed, discovered. At least this is what was reported in some newspapers.</p>
<p>155
00:30:17.430 --&gt; 00:30:27.399
Jacob Barhak: and and then vice versa. Some people died of Covid, maybe were written down that they died out of something else because of complication we never know. So</p>
<p>156
00:30:27.770 --&gt; 00:30:36.420
Jacob Barhak: most, you can actually do this. And this is what Lucas the watcher, did. Did it per state, and he gave me a bunch of numbers per state.</p>
<p>157
00:30:36.580 --&gt; 00:30:49.569
Jacob Barhak: So now I'm running all of those models to make sure that whatever is being told is correct. So now we have the 2 numbers, the true number that the model knows of how many people are infected, and the observed numbers which</p>
<p>158
00:30:50.160 --&gt; 00:30:58.770
Jacob Barhak: takes all of those, and and and adjust this to the according to the real number. So in the zoom number the it will be different.</p>
<p>159
00:30:59.020 --&gt; 00:31:20.830
Jacob Barhak: So now let's look at the results. This takes a lot of computing time to actually do. It's about 3 years of computation on one CPU or on on one CPU core. I use a 24 core machine. So it takes about 6 weeks to run that simulation that, you see, I'm now running a much, much bigger simulation. I might show you the screen while it's happening later on.</p>
<p>160
00:31:22.470 --&gt; 00:31:49.910
Jacob Barhak: I. Each time each state is represented by 10,000 individuals, and those 10,000 interact kind of with each other, and also with all of the equations that I showed you. And each equation comes in with a different weight. And what I do is I run all the simulations and then test whether the numbers match or don't, how well or how badly they match or don't match their reported numbers. So let's look at this.</p>
<p>161
00:31:50.140 --&gt; 00:31:56.009
Jacob Barhak: It's a huge I I cannot even show you the real results. This is a cut down version.</p>
<p>162
00:31:56.860 --&gt; 00:32:03.119
Jacob Barhak: because otherwise it the file sizes become enormous at some point. But let's explain what you're seeing.</p>
<p>163
00:32:03.240 --&gt; 00:32:09.579
Jacob Barhak: This is the population. panel you see here for each state</p>
<p>164
00:32:10.720 --&gt; 00:32:27.540
Jacob Barhak: multiple of those circles each circles will present one day in one simulation, and I will start simulations again in different times. Because, remember, like, you have the timeline running and I</p>
<p>165
00:32:28.190 --&gt; 00:32:56.680
Jacob Barhak: and I start the simulations once in day, one and then once in day 5, and then check it out after whether day 10 in one, and which means day 5 in the other, are the same. I include all of those together, because I'm if I start all the simulation the same time, and some of the numbers are wrong, then I have a problem. So I have to start the simulation different times in the pandemic and in different windows. Each time I run a window of 21 days.</p>
<p>166
00:32:56.780 --&gt; 00:33:00.859
Jacob Barhak: and then I check after 21 days. How good it matches the results!</p>
<p>167
00:33:01.300 --&gt; 00:33:07.789
Jacob Barhak: I I turned. I tried down the I figured out that I don't. If I don't do those windows, then</p>
<p>168
00:33:07.930 --&gt; 00:33:14.660
Jacob Barhak: the the results are not really good. So those windows really help stabilize the results, because all the numbers become</p>
<p>169
00:33:15.080 --&gt; 00:33:23.000
Jacob Barhak: the it's much less sensitive to to wrong number. To some errors that appear so.</p>
<p>170
00:33:23.750 --&gt; 00:33:37.199
Jacob Barhak: What happens here is one circle represents, for example, the electric circle. This is Kentucky code 45 means that it starts 45 days into the after April first.st</p>
<p>171
00:33:37.300 --&gt; 00:34:03.680
Jacob Barhak: This is where the simulation starts, and then it runs for 21 days. So after 21 days you can look at the results here. It will give you the average age and stuff like this. But look at the numbers that says, observed, observed, infected, you will see a number before the slash, and the number after the slash. Same thing observed death before the slash and after the slash the number before is what the model tells you.</p>
<p>172
00:34:03.980 --&gt; 00:34:14.260
Jacob Barhak: The number after is the actual number, as reported by Covid tracking project after, of course, being optimized after being normalized to 10,000 people per state.</p>
<p>173
00:34:14.650 --&gt; 00:34:20.170
Jacob Barhak: So this is out of 10,000 people. So you see the models way off like twice here.</p>
<p>174
00:34:20.570 --&gt; 00:34:24.529
Jacob Barhak: Now, the height of that circle</p>
<p>175
00:34:24.750 --&gt; 00:34:47.079
Jacob Barhak: is the error. It's called the fitness score. That takes into account the differences between the infectious, the observed infection model and the Re and the observed results and the death model and the observed results and the hospitalization model observed results. They're all bundled together.</p>
<p>176
00:34:47.300 --&gt; 00:34:56.389
Jacob Barhak: And then this is being optimized using gradient descend, which is an AI technique that people are familiar with. This is what the base for all our neural networks is.</p>
<p>177
00:34:56.739 --&gt; 00:34:57.690
Jacob Barhak: So</p>
<p>178
00:34:58.200 --&gt; 00:35:09.280
Jacob Barhak: now I'm taking all of this, and I'm starting to optimize. Notice. The height of the circle is the error, and I'm trying to drop it down ideally. I want everything to be around 0.</p>
<p>179
00:35:09.810 --&gt; 00:35:18.669
Jacob Barhak: And what I'm actually optimizing. Let's explain what happens here. You see, all those 5 blue ones. Those are all infectious bottles.</p>
<p>180
00:35:18.890 --&gt; 00:35:26.039
Jacob Barhak: those purple ones are transmission models. The green ones are the behavioral models.</p>
<p>181
00:35:26.060 --&gt; 00:35:43.350
Jacob Barhak: those the reddish or brownish ones, and the yellow ones are the hospitalization models. You remember, we have time and probability, and those are mortality models, and finally, the purple ones are mortality observer models. All of those are being a hospital</p>
<p>182
00:35:43.350 --&gt; 00:35:58.570
Jacob Barhak: optimize at the same time. So now I'm looking, I'm I'm tweaking them, and I'm running ready in the center. There are variations. And then, little by little, you said, even like after 3 iterations, see, one of the transmission models totally disappears. We'll tell you which one in a second.</p>
<p>183
00:36:00.610 --&gt; 00:36:21.890
Jacob Barhak: and then it continues continues, and see some here more dominant mortality models. So some of the mortality models were not that good? And continue and continue. Now you can build those curves I showed you at the beginning, you can actually figure out what the transmission was, how people behaved. You'll see the apple mobility data disappeared completely</p>
<p>184
00:36:22.656 --&gt; 00:36:29.049
Jacob Barhak: within the transmission models, the model that disappeared in this one became dominant. Those are the ones with temperature.</p>
<p>185
00:36:29.600 --&gt; 00:36:35.140
Jacob Barhak: The ones with it says that colder states transmit more</p>
<p>186
00:36:35.590 --&gt; 00:36:41.780
Jacob Barhak: is the one that is dominant, meaning. If you're in a hot state you'll be. You're better off than in the cold state.</p>
<p>187
00:36:42.110 --&gt; 00:36:44.770
Jacob Barhak: because this almost disappeared completely.</p>
<p>188
00:36:46.390 --&gt; 00:36:52.820
Jacob Barhak: here in the infectious is one. You see that some of the dominant models, the one that were longer, not the one that shorter.</p>
<p>189
00:36:53.680 --&gt; 00:37:02.510
Jacob Barhak: And finally, if you look up the mortality model, the one that Felipe Castiglio gave me is the dominant one</p>
<p>190
00:37:02.750 --&gt; 00:37:07.500
Jacob Barhak: meaning this is these models actually better if you look at it over time</p>
<p>191
00:37:08.365 --&gt; 00:37:30.270
Jacob Barhak: and observer models tell you that some of the models don't make sense like don't multiply by 20. It's somewhere between 1 5 or 7, 5 or 7, 1, 5. Approximately, the people who said that 5 times which you multiply, the number of infections, the correct number, those who are generally correct.</p>
<p>192
00:37:30.440 --&gt; 00:37:34.860
Jacob Barhak: Approximately, we can actually calculate the exact numbers. But this doesn't matter, because</p>
<p>193
00:37:35.070 --&gt; 00:37:43.209
Jacob Barhak: it's it's it's enough to know that it's what's wrong and what's not. And now we can answer questions about the pandemic.</p>
<p>194
00:37:43.530 --&gt; 00:37:53.749
Jacob Barhak: And I've written all those things here. But let's talk about a little bit about conclusions. I'll I'll just conclude everything in case you might have questions.</p>
<p>195
00:37:54.540 --&gt; 00:37:55.809
Jacob Barhak: and I want to go</p>
<p>196
00:37:56.190 --&gt; 00:38:00.710
Jacob Barhak: too much overtime. I want to keep it short. Otherwise it's me talking. I want to hear you.</p>
<p>197
00:38:01.224 --&gt; 00:38:06.839
Jacob Barhak: So the idea is that this model can help the government in the next pandemic.</p>
<p>198
00:38:08.740 --&gt; 00:38:09.600
Jacob Barhak: Now</p>
<p>199
00:38:09.710 --&gt; 00:38:20.649
Jacob Barhak: I'm telling you this, but I am biased because I developed it. This is developing for about since 2,012, I invested my all my time and effort and resources into this</p>
<p>200
00:38:21.990 --&gt; 00:38:30.209
Jacob Barhak: quite a bit. I've been doing this for many years on my own. I'm a sole proprietor now, meaning I'm a company of one person in the us.</p>
<p>201
00:38:31.730 --&gt; 00:38:38.319
Jacob Barhak: It's a form of explainable artificial intelligence, because I can explain to you things as you saw.</p>
<p>202
00:38:38.460 --&gt; 00:38:43.289
Jacob Barhak: So it's AI. But the explainable type. Once I'm showing you</p>
<p>203
00:38:43.380 --&gt; 00:39:08.519
Jacob Barhak: it's sometime now. The question is, how how good it is. So I can tell you for sure. It is difficult to explain phenomena like COVID-19, because there are many, many parameters, and the question is, how much time you know, because the next pandemic, if someone comes to me and ask me how good this tool will be, I tell them. Well, you have to have at least 3 weeks of data after you having some sort of</p>
<p>204
00:39:08.560 --&gt; 00:39:32.929
Jacob Barhak: infection going on, I started modeling April 2020. So you need at least 4 weeks of data, but not this is actually not true, because 3 weeks of data will give you initially. When I did this it will give you different results, because you don't have enough numbers. You have to forward for at least several months and then do the windows I'm talking about, and then you and then the numbers started stabilizing it.</p>
<p>205
00:39:33.200 --&gt; 00:40:00.399
Jacob Barhak: I'm still running a big simulation here to make sure those numbers are correct, because I'm doing Monte Carlo simulations when I flip all those coins. And I'm making sure that I throw enough computing power there to actually make it useful, and that I didn't make any mistakes. Eventually, every once in a while I find something that it was wrong that I need to correct. This is why there are multiple versions. But in the future you'll need at least a few months of data.</p>
<p>206
00:40:00.940 --&gt; 00:40:11.729
Jacob Barhak: plus you have to one of those windows, but maybe you'll get initial results after a month or 2, and and at least some sense you'll get, and then later on you'll get</p>
<p>207
00:40:11.860 --&gt; 00:40:21.439
Jacob Barhak: you. You go with it, so the Government will not be completely clueless like it was in the pandemic, because now we find out how clueless they were.</p>
<p>208
00:40:23.490 --&gt; 00:40:34.629
Jacob Barhak: now I can tell you that the peak average infections to about day 5 from infection. This is for covid and transmission rate is about 2% per encounter. It's a little bit less</p>
<p>209
00:40:34.750 --&gt; 00:40:38.140
Jacob Barhak: warm weather seems to reduce transmission. Now.</p>
<p>210
00:40:38.520 --&gt; 00:40:52.450
Jacob Barhak: this is something important. Today we published the paper lessons learned from COVID-19 for modeling COVID-19, and steps to take in the next pandemic here. I'll show you the paper it will load. It's now in the preprint.</p>
<p>211
00:40:53.000 --&gt; 00:41:06.579
Jacob Barhak: We have multiple collaborators and they have actually came from different perspectives. I'm not. I did the reference model, but they did other models that figured out how to do other things better.</p>
<p>212
00:41:06.710 --&gt; 00:41:24.230
Jacob Barhak: and we wrote down the paper and says, what how to do, modeling better, how to spread information better and get it correctly, how to validate the information properly, and also we have some recommendation about infrastructure and education that will help.</p>
<p>213
00:41:24.400 --&gt; 00:41:33.289
Jacob Barhak: So if you're interested, please go and read the paper. It's in preprint. You can get it by following this link number 53 here.</p>
<p>214
00:41:34.960 --&gt; 00:41:59.159
Jacob Barhak: All of what I showed you here cannot produce to some degree. I'm showing exactly what I what data I did. So I can trace back. If someone ever asks me about the presentation where I got the numbers from the codes that actually created the presentation. You can find it on Github. I actually release it. But not all of the data. Some things up priority. I have a conflict of interest statement here</p>
<p>215
00:41:59.160 --&gt; 00:42:05.399
Jacob Barhak: because I do intend to get money out of it. I have 2 patents us patents on this technology.</p>
<p>216
00:42:05.871 --&gt; 00:42:12.049
Jacob Barhak: I'm now licensing them. If someone interested know someone who's interested, please do connect us.</p>
<p>217
00:42:12.280 --&gt; 00:42:32.328
Jacob Barhak: And I've many, many people to think and many organizations. Some people gave me some all sorts of all sorts of help people from all of his allowed all this the presentation technology ideas might just help me out finding some resources and connecting to some people that help.</p>
<p>218
00:42:32.930 --&gt; 00:42:43.334
Jacob Barhak: people hosted my computer published the the video, you can actually see this. This is the video that's embedded. You can actually look at it later. So it was published in Siphode.</p>
<p>219
00:42:44.108 --&gt; 00:42:59.230
Jacob Barhak: people who will contribute models. I'm showing some of the other work they did here. People who contributed money to actually run simulations on cloud, I actually ran several simulations in the cloud, which, instead of several weeks or months. They ran in 2 days.</p>
<p>220
00:42:59.850 --&gt; 00:43:21.759
Jacob Barhak: not all the time you have money for that. So Rescale gave me money for azure, and Amazon and Midas gave me some money. I ran simulation on the Google card with them. They gave it through some grant and many, many people. I have to thank for all the way for various ideas or things. So thank you all. And I'm open for questions</p>
<p>221
00:43:22.170 --&gt; 00:43:23.910
Jacob Barhak: underneath this.</p>
<p>222
00:43:28.300 --&gt; 00:43:33.410
Jim Mccormack: So, Jacob, have you been able to use any of it on like bird flu, or any other pandemics that</p>
<p>223
00:43:34.160 --&gt; 00:43:37.540
Jim Mccormack: that are starting or rumored to be the next pandemic.</p>
<p>224
00:43:37.780 --&gt; 00:43:50.169
Jacob Barhak: Currently, I'm focusing on Covid because this is technology. Still believe it or not, it's still in development phase. Because here you show me any technology that you invested 20, many years in?</p>
<p>225
00:43:50.360 --&gt; 00:43:53.360
Jacob Barhak: Is it something good enough? 20, many years?</p>
<p>226
00:43:53.900 --&gt; 00:43:58.020
Jacob Barhak: This is what they? It's about 20. It's about 20 years of development.</p>
<p>227
00:43:59.560 --&gt; 00:44:07.305
Jacob Barhak: You tell me so. I I'm still making sure I'm cutting all the bits and pieces.</p>
<p>228
00:44:07.920 --&gt; 00:44:20.389
Jacob Barhak: so for this such technologies, you need many more resources to actually meaning I'm talking. I'm not talking about, you know, some game that blows up that people use, or something that is fairly tested.</p>
<p>229
00:44:20.980 --&gt; 00:44:40.410
Jacob Barhak: Sometimes they are not really retested. They just blow up like virally. Here, you actually to be correct, that this technology works, you actually have to invest a lot of time. Now, the big problem with all of the data, which is a different project that I'm making is the fact that you cannot get medical data.</p>
<p>230
00:44:40.620 --&gt; 00:44:57.669
Jacob Barhak: One of the advantages of this project is that it can actually merge data from multiple sources. This is practically not allowed in the medical world, because if you have population A and population B in the medical world. They are not allowed to merge the data between those 2.</p>
<p>231
00:44:58.220 --&gt; 00:45:14.120
Jacob Barhak: No, no, because it's patient data. So the individual data is not allowed. But you're allowed to merge the models. And this is what I'm doing. This is why this technology is important, not only for the pandemics I'm doing. I'm doing it on the pandemics, because there I have good data.</p>
<p>232
00:45:14.220 --&gt; 00:45:37.740
Jacob Barhak: The other model I have is the diabetes model today, as far as I know, I have the most validated diabetes model worldwide, because I tested it with more populations than anyone else. How did I do it? I connected to clinicaltrials.gov got the information from there. But the thing is even data and clinicaltrials.gov, is not that good here. I'll show you. That's another project that I'm working on.</p>
<p>233
00:45:38.060 --&gt; 00:45:49.700
Jacob Barhak: Even to get the data out of those models. It's impossible. Because the units of measure are messed up, even if you do it correctly, I'm going to show you just only one thing.</p>
<p>234
00:45:50.650 --&gt; 00:45:57.069
Jacob Barhak: It will take a minute to load. This is a website. It's actually active. You can check it out. But, like here, Hba, one C is a measure of diabetes.</p>
<p>235
00:45:57.900 --&gt; 00:46:05.770
Jacob Barhak: So see how many ways people write. Hba, one C, the units of measure. A computer cannot understand it.</p>
<p>236
00:46:06.330 --&gt; 00:46:18.110
Jacob Barhak: So you need AI to tell it how it's supposed to be. That's a different project I'm doing, which is a spin off of this project. And actually, there's some claims in the patents that relate to this project as well. So</p>
<p>237
00:46:18.290 --&gt; 00:46:29.200
Jacob Barhak: eventually, being able to get the disease models correctly, you need correct data. You messing around with the data is the biggest problem.</p>
<p>238
00:46:29.550 --&gt; 00:46:34.450
Jacob Barhak: So you're saying bird flu because everyone says bird flu. But you have good data about bird flu.</p>
<p>239
00:46:35.380 --&gt; 00:46:39.609
Jacob Barhak: Well, once you start with Covid, you didn't have good data.</p>
<p>240
00:46:39.720 --&gt; 00:46:54.809
Jacob Barhak: And this is why we had this project running. And this is where accommodations, how to get good data for the future and how to do it. So we can models can use it can actually get the results I got I got it took me about 5 years to get to the something stable that I'm showing you now.</p>
<p>241
00:46:55.100 --&gt; 00:47:16.629
Jacob Barhak: and I start in the beginning of the pandemic in the next pandemic. If it'll be one year it will be better, and then it will be few 3 months, or 2 months, or 2 weeks, then it will better and better and better. But to get to that point we need an entire lubricated system that gives us all the things that we need all the correct data and all the correct models, and so on and so forth.</p>
<p>242
00:47:16.630 --&gt; 00:47:25.530
Jacob Barhak: This is still a lot of work. The big systems are not yet set to it, and hopefully this paper will be helpful in this regard.</p>
<p>243
00:47:25.670 --&gt; 00:47:30.479
Jacob Barhak: and think about how much money was found in Covid, and how many models are out there.</p>
<p>244
00:47:30.890 --&gt; 00:47:51.490
Jacob Barhak: Think about some big machine that crunches all of those assumptions in the future that people plug in and tells them, oh, this is somebody is probably incorrect. This doesn't match this data or that data. This is what my technology does. And we do have the computing power today. But we do need the the software infrastructure, and all those many years that I only started.</p>
<p>245
00:47:51.660 --&gt; 00:47:53.450
Jacob Barhak: Did did I answer your question, Jim?</p>
<p>246
00:47:53.710 --&gt; 00:48:09.379
Jim Mccormack: Yeah, yeah, and very well, so it's it is. It is again, Jacob. If I restate right? So it may not be directly translatable to bird flu. But the lessons learned and the prep work right will get us to those answers faster. Using this example in this work.</p>
<p>247
00:48:09.530 --&gt; 00:48:28.879
Jacob Barhak: I didn't try it on birth flu. If you give me data of birth flow, I can try it. But then I'll tell you. Oh, this is missing, and this is missing, and I need all those assumptions. And then you'll start finding all those researchers, and they won't go even finding researchers to collaborate, to give you all those models also takes time. All of this has to be centralized in a way that</p>
<p>248
00:48:29.370 --&gt; 00:48:37.410
Jacob Barhak: the system was during Covid. Everything was kind of a mess. I wasn't part of that mess. I was trying to find things. I couldn't find them.</p>
<p>249
00:48:37.900 --&gt; 00:48:45.199
Jacob Barhak: I I believe I was. I was stressed because I was working on this, for at that point, like 15 years, and I still was</p>
<p>250
00:48:45.860 --&gt; 00:49:01.209
Jacob Barhak: feeling that I don't need. I have everything I need, like all the pieces and pieces now, and we're trying to normalize it and give more accommodations how to do it better in the future. The idea is that someone who actually is interested.</p>
<p>251
00:49:01.210 --&gt; 00:49:20.360
Jacob Barhak: We'll look at it, learn from it, and then start training groups that will do those things. I have a friend who actually has some good ideas about his name is John Rice, and he actually was the instigator of this paper, saying, You know, what what did you learn from all this work that you did about Covid. So we organized it all in a way that in the next pandemic</p>
<p>252
00:49:20.530 --&gt; 00:49:29.019
Jacob Barhak: there won't be such a problem. And we're now propagating this paper. If you're interested in helping take this paper, send some of your friends.</p>
<p>253
00:49:29.910 --&gt; 00:49:32.570
Gabor Szabo: There's another question I see in the in the chat.</p>
<p>254
00:49:35.170 --&gt; 00:49:36.230
Gabor Szabo: Can I read it out.</p>
<p>255
00:49:37.783 --&gt; 00:49:38.549
Jacob Barhak: Yeah. Please.</p>
<p>256
00:49:38.730 --&gt; 00:49:44.720
Gabor Szabo: For type, one diabetes modeling. Isn't that a generic disease? Autoimmune.</p>
<p>257
00:49:45.690 --&gt; 00:49:46.115
Jacob Barhak: Well.</p>
<p>258
00:49:47.190 --&gt; 00:49:55.940
Jacob Barhak: type diabetes is different than type 2. They have different mechanisms. I'm not a medical doctor to goes into those.</p>
<p>259
00:49:56.540 --&gt; 00:50:05.610
Jacob Barhak: It was explained to me many times while I was doing diabetes type 2 diabetes. I was working with a team of experts, worldwide experts in diabetes.</p>
<p>260
00:50:06.670 --&gt; 00:50:12.190
Jacob Barhak: I'm less concerned about the type of disease or what it is.</p>
<p>261
00:50:12.350 --&gt; 00:50:24.139
Jacob Barhak: Diseases for me are state transition models, where you jump from one state to another, and there's a probability of moving there, and the probability depends on all sorts of parameters.</p>
<p>262
00:50:24.380 --&gt; 00:50:25.230
Jacob Barhak: So</p>
<p>263
00:50:25.480 --&gt; 00:50:42.630
Jacob Barhak: the source of the disease or the cures, I don't care much. I just want to be able to explain it, explain it. And how do I explain it? If I have a model that says A, and then model, says BI want to know which one of those contributes more to the numbers I see at the end.</p>
<p>264
00:50:42.810 --&gt; 00:50:52.719
Jacob Barhak: The way I look at the diseases. It's all numbers and some people who understand all of the elements. They are the ones making the models.</p>
<p>265
00:50:53.240 --&gt; 00:50:56.230
Jacob Barhak: So does it answer your question.</p>
<p>266
00:51:00.390 --&gt; 00:51:07.479
Jacob Barhak: Okay, so this is why it is. And I'll try to send you the link for that.</p>
<p>267
00:51:08.800 --&gt; 00:51:09.860
Jacob Barhak: Here we go.</p>
<p>268
00:51:10.650 --&gt; 00:51:31.700
Jacob Barhak: This is the link for that paper, the lessons learned paper. So if you know people who are interested please propagate that this is important. Hopefully, some governments or some organizations will adapt. Next pandemic will have less mess than we had in this pandemic. By the way, when I started the pandemic, everyone was doing Covid</p>
<p>269
00:51:31.950 --&gt; 00:51:45.610
Jacob Barhak: really. Like every every department, every institution was like financial institutions were running Covid models computation institutions were having Covid. Everyone</p>
<p>270
00:51:45.860 --&gt; 00:51:53.350
Jacob Barhak: had Covid model. Now I'm i i even talking to people who model Covid is kind of hard</p>
<p>271
00:51:53.520 --&gt; 00:52:16.299
Jacob Barhak: because they they all stopped doing it. It's less interesting. But for me it's interesting very much to know, because I really am dedicated to it. And this is my life's work. I've been working on this almost 20 years, and I want this to continue for and and done properly. So this is why I'm giving this talk, and Gabor. Thank you for having me.</p>
<p>272
00:52:17.117 --&gt; 00:52:26.700
Gabor Szabo: Thank you for giving this talk. This presentation. And anyone any more questions before we</p>
<p>273
00:52:27.180 --&gt; 00:52:29.009
Gabor Szabo: we close the video.</p>
<p>274
00:52:31.430 --&gt; 00:52:40.860
Jacob Barhak: If you have python questions on how I did this with python, or how I did that, it's also I can. I'm running many, many things. Actually, maybe it's a good time to show you.</p>
<p>275
00:52:41.020 --&gt; 00:52:42.370
Jacob Barhak: You see</p>
<p>276
00:52:43.590 --&gt; 00:53:03.189
Jacob Barhak: this thing hopefully, I won't disconnect everything. This is a call, a screen of a computer. Behind it is a 24 core machine. Like a very good processor, it runs as fast as older supercomputers. 32, core, 32 threads. And this simulation. Now it's</p>
<p>277
00:53:04.120 --&gt; 00:53:04.890
Jacob Barhak: team.</p>
<p>278
00:53:04.890 --&gt; 00:53:07.520
Gabor Szabo: If you stop the screen sharing you'll see better.</p>
<p>279
00:53:08.430 --&gt; 00:53:10.879
Jacob Barhak: Oh, oh, okay, I'll stop screen share.</p>
<p>280
00:53:11.850 --&gt; 00:53:12.820
Jacob Barhak: Second.</p>
<p>281
00:53:16.910 --&gt; 00:53:19.340
Jacob Barhak: how do I stop the screen? Share?</p>
<p>282
00:53:21.328 --&gt; 00:53:25.809
Jacob Barhak: Think it shows me screen share. How do I stop it? It says, only share.</p>
<p>283
00:53:26.120 --&gt; 00:53:27.730
Jacob Barhak: Did I stop the share?</p>
<p>284
00:53:28.380 --&gt; 00:53:28.910
Jacob Barhak: No.</p>
<p>285
00:53:28.910 --&gt; 00:53:29.850
Gabor Szabo: Oh, not yet.</p>
<p>286
00:53:30.770 --&gt; 00:53:31.900
Jacob Barhak: Give me a second.</p>
<p>287
00:53:35.300 --&gt; 00:53:41.260
Jacob Barhak: I'm not sure how to stop the share in this model. Oh, oh, okay, thank you.</p>
<p>288
00:53:41.460 --&gt; 00:54:08.409
Jacob Barhak: You see, this. This screen is actually a computer that runs simulation, same simulation. You see. Now, it's around here. This is iteration 17. It should get to 40 to get something stable approximately, this is my currently baseline. I've seen all the simulation 40, but this one is like much, much bigger. Here. I start the simulation in each day, and I run 5 repetitions</p>
<p>289
00:54:08.700 --&gt; 00:54:15.240
Jacob Barhak: for all of this. I'm I'm actually running about like 2 months of day for I have information from April to June.</p>
<p>290
00:54:15.260 --&gt; 00:54:43.210
Jacob Barhak: and each time I start a different day and run for 21 days and check the numbers for 5 simulations for each State, and then I continue doing this and this state. This will take me about half a year. I started a few months ago, and I had some options, power problems. And so on. So for computer problems, I actually burned computers on this. I kid you, not. I have multiple computers dead because of running all those simulations running all throughout the world for many, many years.</p>
<p>291
00:54:43.340 --&gt; 00:54:52.760
Jacob Barhak: So I started with small clusters. I created clusters. I use dusk to create clusters. To this. I still it still runs with dusk. You cannot</p>
<p>292
00:54:53.020 --&gt; 00:54:54.450
Jacob Barhak: see the best. This is.</p>
<p>293
00:54:56.670 --&gt; 00:54:59.559
Gabor Szabo: Yeah, I can't. We can't really see that. No? Well.</p>
<p>294
00:55:00.037 --&gt; 00:55:04.800
Gabor Szabo: apologies, I cannot make it much, much closer, very bright.</p>
<p>295
00:55:05.370 --&gt; 00:55:09.679
Jacob Barhak: I apologize. This is, this is the best I can do, anyway.</p>
<p>296
00:55:10.630 --&gt; 00:55:16.720
Gabor Szabo: Oh, thank you very much again, and thank you everyone for for being here, and if you're watching.</p>
<p>297
00:55:16.830 --&gt; 00:55:27.079
Gabor Szabo: I mean they they you share the links. So people can find you. They will be under the the video. There will be a link for all the all these things.</p>
<p>298
00:55:28.670 --&gt; 00:55:30.450
Gabor Szabo: Like the video</p>
<p>299
00:55:30.740 --&gt; 00:55:38.570
Gabor Szabo: share, follow the channel, share the video and talk to Jacob. If you were interested in discussing this topic.</p>
<p>300
00:55:38.870 --&gt; 00:55:40.539
Jacob Barhak: Please. Thank you very much.</p>
<p>301
00:55:40.540 --&gt; 00:55:41.120
Jim Mccormack: Thank you.</p>
<p>302
00:55:41.120 --&gt; 00:55:41.760
Gabor Szabo: Right.</p>
<p>303
00:55:42.730 --&gt; 00:55:43.610
Jacob Barhak: Bye, bye.</p>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>logger.info(f&quot;Don&#39;t Give all your {secrets} away&quot;) with Tamar Galer</title>
    <summary type="html"><![CDATA[]]></summary>
    <updated>2024-12-12T16:30:01Z</updated>
    <pubDate>2024-12-12T16:30:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/logger-info-with-tamar-galer" />
    <id>https://python.code-maven.com/logger-info-with-tamar-galer</id>
    <content type="html"><![CDATA[<p>Explore the transition from developer to security researcher, addressing log safety in applications. Learn common mistakes, practical Python solutions, and empower developers to protect against data exposure.</p>
<ul>
<li>
<p><a href="https://www.ox.security/">OX Security</a></p>
</li>
<li>
<p><a href="https://cwe.mitre.org/data/definitions/532.html">CWE-532: Insertion of Sensitive Information into Log File</a></p>
</li>
<li>
<p><a href="https://github.com/gitleaks/gitleaks">Gitleaks</a></p>
</li>
<li>
<p><a href="https://github.com/oxsecurity/MaskerLogger">MaskerLogger</a></p>
</li>
<li>
<p><a href="https://www.linkedin.com/in/tamar-galer/">Tamar Galer</a></p>
</li>
</ul>
<p>In my seven years as a software developer, I've primarily worked in teams composed solely of developers. However, my recent transition to a team of security researchers has opened my eyes to a crucial aspect that often goes unnoticed: log safety in applications.</p>
<p>My exposure to the application security ecosystem and real-life security breach analysis has opened my eyes to recognize code security issues, including the prevalence of sensitive information, tokens, passwords, and payment details, in plaintext logs. This may lead to severe data breaches, financial losses, and all kinds of catastrophes.</p>
<p>This talk will dive into the fatal mistakes developers often make that can result in the disclosure of sensitive information in logs. We'll explore the types of sensitive data in logs.</p>
<p>I'll share my personal experiences as a developer on a security research team and shed light on the often-overlooked consequences of insecure logging practices. We'll discuss practical patterns to safeguard sensitive information in Python applications, including identifying and redacting sensitive data before it reaches log files, and implementing secure logging practices.</p>
<p>By the end of this talk, developers will be equipped with the knowledge and tools to protect sensitive data from accidental disclosure and safeguard their applications from the perils of sensitive data exposure. Embrace the journey towards log safety and ensure your code remains secure and confidential.</p>
<p><img src="images/tamar-galer.jpg" alt="" /></p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/iAnoCDQflsI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>FastAPI</title>
    <summary type="html"><![CDATA[A series about web application development using Python FastAPI.]]></summary>
    <updated>2024-11-04T12:35:01Z</updated>
    <pubDate>2024-11-04T12:35:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/fastapi" />
    <id>https://python.code-maven.com/fastapi</id>
    <content type="html"><![CDATA[<ul>
<li><a href="/fastapi-hello-world">Hello World</a></li>
<li><a href="/fastapi-dynamic-response-of-the-current-timestamp">Dynamic response of the current timestamp</a></li>
</ul>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>FastAPI - dynamic response of the current timestamp</title>
    <summary type="html"><![CDATA[The second step in using FastAPI is to generate some data dynamically.]]></summary>
    <updated>2024-11-04T12:30:01Z</updated>
    <pubDate>2024-11-04T12:30:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/fastapi-dynamic-response-of-the-current-timestamp" />
    <id>https://python.code-maven.com/fastapi-dynamic-response-of-the-current-timestamp</id>
    <content type="html"><![CDATA[<iframe width="560" height="315" src="https://www.youtube.com/embed/kh9d8rgo08c" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<p>2024-11-04-fastapi-dynamic-response-current-timestamp-python-english.mp4</p>
<p><a href="https://slides.code-maven.com/python/fastapi-dynamic-response">source</a></p>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>Hello World! with FastAPI</title>
    <summary type="html"><![CDATA[Getting started web development with FastAPI.]]></summary>
    <updated>2024-11-01T13:30:01Z</updated>
    <pubDate>2024-11-01T13:30:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/fastapi-hello-world" />
    <id>https://python.code-maven.com/fastapi-hello-world</id>
    <content type="html"><![CDATA[<iframe width="560" height="315" src="https://www.youtube.com/embed/OYcdKUyX9-k" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<p>2024-11-01-fastapi-hello-world-python-english.mp4</p>
<p><a href="https://slides.code-maven.com/python/fastapi-hello-world">source code</a></p>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>Raise Exception from None</title>
    <summary type="html"><![CDATA[Raising exceptions inside other exception can be confusing. This might help with the clarification.]]></summary>
    <updated>2024-10-30T15:30:01Z</updated>
    <pubDate>2024-10-30T15:30:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/raise-exception-from" />
    <id>https://python.code-maven.com/raise-exception-from</id>
    <content type="html"><![CDATA[<iframe width="560" height="315" src="https://www.youtube.com/embed/7mSGocsF_ag" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<p><a href="https://slides.code-maven.com/python/raise-exception-from">source</a></p>
<p>2024-10-30-raise-exception-from.mp4</p>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>Testing python with pytest: The magic of fixtures</title>
    <summary type="html"><![CDATA[Fixtures to prepare the test environmen and to clean it up at the end.]]></summary>
    <updated>2024-08-11T09:30:01Z</updated>
    <pubDate>2024-08-11T09:30:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/testing-python-with-pytest-the-magic-of-fixtures" />
    <id>https://python.code-maven.com/testing-python-with-pytest-the-magic-of-fixtures</id>
    <content type="html"><![CDATA[<iframe width="560" height="315" src="https://www.youtube.com/embed/IFfnDpbVrHQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>Getting started with Linux on Linode in the cloud for less than 1 cent</title>
    <summary type="html"><![CDATA[]]></summary>
    <updated>2024-07-09T21:30:01Z</updated>
    <pubDate>2024-07-09T21:30:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/getting-started-with-linux-on-linode" />
    <id>https://python.code-maven.com/getting-started-with-linux-on-linode</id>
    <content type="html"><![CDATA[<iframe width="560" height="315" src="https://www.youtube.com/embed/WkwpJu90cjY" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<ul>
<li><a href="https://www.virtualbox.org/">VirtualBox</a> I use to run Windows on my Ubuntu/Linux so I can demo how to connect to a remote Linux box from Windows. You could use it to install Linux locally on top of your Windows machine.</li>
</ul>
<h2 class="title is-4">Terminal emulator / local shell</h2>
<ul>
<li>
<p>Install <a href="https://git-scm.com/">Git-SCM</a> and use Git Bash</p>
</li>
<li>
<p>or install <a href="https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html">Putty</a></p>
</li>
</ul>
<h2 class="title is-4">Linode</h2>
<ul>
<li>
<p>Register on <a href="https://linode.com">Linode</a></p>
</li>
<li>
<p>Admin interface of <a href="https://cloud.linode.com/linodes">Linode</a></p>
</li>
</ul>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

  <entry>
    <title>Getting started with web development using Python Flask</title>
    <summary type="html"><![CDATA[The beginning of a web application]]></summary>
    <updated>2024-07-09T12:30:01Z</updated>
    <pubDate>2024-07-09T12:30:01Z</pubDate>
    <link rel="alternate" type="text/html" href="https://python.code-maven.com/getting-started-with-web-development-using-python-flask" />
    <id>https://python.code-maven.com/getting-started-with-web-development-using-python-flask</id>
    <content type="html"><![CDATA[<iframe width="560" height="315" src="https://www.youtube.com/embed/2d8bR5ArbvY" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<p><a href="https://github.com/szabgab/web-application-with-flask-2024-07-07">repository with all the links</a></p>
]]></content>
    <author>
      <name>Gábor Szabó</name>
    </author>
  </entry>

</feed>

