Stories by Omar Aflak on Medium

Information & Entropy

Omar Aflak — Fri, 21 Feb 2025 18:22:32 GMT

What is information and how is it measured? What is entropy? Cross entropy? Relative entropy (aka KL Divergence)?

UPDATE: Check out the updated version of this article at: https://omaraflak.com/articles/entropy

Information

Information is tied to the field of probabilities, and it can be seen as a measure of uncertainty. To avoid extrapolation and misuse of this concept, you need to remember that it only makes sense to talk about information (in the mathematical sense) when you are studying a probabilistic event.

Information relates to probabilities in that the realisation of an event with low probability brings a lot of information, and the realisation of an event with high probability brings little information.

For example: the event “It rains in London” is very likely therefore it brings little information. In contrast, the event “It rains in the Sahara Desert” is so unlikely, it brings a lot more information (e.g. it could more realistically help you pin-point the day of the event).

So information relates to probability, but how exactly?

Let’s explore the properties we would like such a mapping to have:

Low probability ⇒ high information (already established)
High probability ⇒ low information (already established)
Probability = 1 ⇒ Information = 0 (derived from 2 — that’s because if an event is certain to be realised, then knowing about it doesn’t bring about any information)
Probability → 0 ⇒ Information → +inf (the opposite of 3 must be true)
Information should be additive for independent events, i.e. learning about two independent events should give you the amount of information equal to the sum of the information gained from each event separately:

information(event1 and event2) = information(event1) + information(event2)

Given that the probability of two independent events to be realised together is p(event1) * p(event2) (product of the probabilities of each event to happen individually), then we must have:

information(p(event1) * p(event2)) = information(p(event1)) + information(p(event2))

I’m abusing the previous notation, because here we passed a probability into this information function instead of an event, but that’s just to illustrate the idea…

The realisation of two independent events brings as much information as the sum of the information gained from each event separately.

Lastly, if we need this mapping function to be continuous, then there’s only one family of functions that respects those properties: the logarithms.

More precisely, the negative logarithms:

-log(x)

We define information mathematically as:

Relates the information content of an event `x` to its probability of realisation `p(x)`

We said “the family of functions” earlier — indeed, logarithms of all bases respect the properties listed above. You can use any of them, the difference will be in the unit of the information:

log2(x) will give “bits”
log10(x) will give “dits”
ln(x) will give “nats”

All of those are valid ways of expressing information. In practice, we often use the base 2 logarithm.

Bit: represents the amount of information content gained with a binary choice.

Example 1

I flip a fair coin p(heads) = p(tails) = 1/2 and tell you the result. I have just given you: -log2(1/2) = log2(2) = 1 bit of information! In other words, I have given you the information content gained with 1 binary choice.

Recall that log(1/x) = -log(x).

This is what the logarithm in base 2 does: it answers the question “how many times do I have to divide x by 2 to get 1 (or less) ?”, or in other words, how many binary choices do I have to make on my input space to be left with 1 element (or less). Each of these binary choices (division) represents one bit of information.

Example 2

I have to pick one fruit between 8 different fruits (assume each is equally likely to be picked). I pick one and tell you which: I have just given you -log2(1/8) = log2(8) = 3 bits of information. In other words, I have given you the information content gained with 3 binary choices (divide 8 by two 3 times).

I could have measured information in “dits” instead. It is equally correct to say that I would have given you -log10(1/8) = log10(8) = 0.9 dits of information.

3 bits = 0.9 dits

Entropy

In the previous examples, you’ll notice that I used uniform probability distributions. This means the probability of each outcome was equally likely (p=1/2 for the coin toss, and p=1/8 for the fruit pick). Then I asked:

What is the information gained for observing one of those events?

Since the probability was the same for all events, then the answer to that question would be the same regardless of the outcome of the random experiment.

What if each outcome had a different probability of realisation?

What if I had to pick between 3 fruits, each with a different probability according to my preferences:

Mango (p=0.7)
Apples (p=0.2)
Orange (p=0.1)

A natural question is: on average, what is the information gained for observing an event from that random experiment? We are asking the same question as before, but of course since each outcome has a different probability and since the information depends on the probability, the result will change for different outcomes. Therefore we ask about the average outcome.

One way is to sum the information gained by each event weighted by the probability of realisation.

Expected information gained for observing an event from the random experiment

That is exactly what entropy is!

We call entropy the expected amount of information gained for observing an event from a random variable. In other words, this answers the question: “If I sample an event from a variable X ; On average, what is the information gained for observing one of x1, x2, ..., or xn given the probability distribution of those events?”.

We usually denote the entropy of a random variable X as H(X):

Entropy

An interesting follow up question is: when is the entropy minimal/maximal ?

Without even doing any calculation, we can intuitively try to answer. Give it a thought!

Minimal Entropy

Since entropy is the expected information to be gained from observing a random variable, and since information is minimal when events are certain to be realised, the absolute minimum would be reached if a random variable could be predictable every time, i.e. if it had an event with probability p=1 and the rest p=0 (in which case H(X)=0). Any other probability distribution would yield some amount of information.

Maximal Entropy

Entropy is maximised if the average information is maximal. We know information is highest for most improbable events (p->0). So if we have multiple events, each with a certain probability, and we want those probabilities to be as low as possible, then the lowest we can go on average is when we spread the probability space over all events equally, that is p=1/n with n the number of events, or in other words: a uniform probability distribution.

You can see the uniform distribution as the most “unpredictable” — the one for which each event brings the maximum amount of information content.

I highly advise checking out 3B1B video on how to solve the game Wordle using the concept of entropy.

https://medium.com/media/ed79509a999d67edf4330eabc3b850c9/href

There’s also another way to interpret entropy, and it’s going to be useful for the rest of the article, so before going further with Cross Entropy and Relative Entropy, we’re making a little stop at encoders.

Encoders

An encoder is a machine/routine/code that assigns to each event of a probability distribution a code (let’s say in bits, but we could use another base).

An encoder is optimal, if on average, it uses the theoretical minimum number of bits possible to represent an event drawn from the distribution.

Example 1

Say we have three events A,B,C, with p(A)=p(B)=p(C)=1/3.

We could create a coding (a mapping) that uses 2 bits to encode each outcome:

A->00, B->01, C->10

If I then give you a list of bits, e.g. 011000, you are able to decode it (by splitting every 2 bits and using the mapping above): 011000 → BCA. This works out fine, but we are waisting the 11 state of our 2 bits, which accounts for 25% of all possible states! This is not very optimal.

What if we assigned less bits to some events ?

Example 2

Consider the following encoder:

A->0, B->10, C->11

Here, we use a total of 5 bits to encode 3 states (instead of 6 bits in the previous coding), that is 5/3 = 1.7 bits on average, which is less than 2 bits like previously.

With this new encoder, suppose we read the first 2 bits of a message b1,b2:

if b1 == 0 then A
if b1 == 1 and b2 == 0then B
if b1 == 1 and b2 == 1then C

And we can keep reading and decoding a long string of bits that way.

You might ask, why not use even less bits? Well, let’s see.

Example 3 ❌

Consider this final encoder:

A->0, B->1, C->00

This uses less bits than the previous too, but it is also ambiguous!

The bits string 00 could be either AA or C , and there’s no way to go around this.

An important feature of an encoder is that it has to be unambiguous, i.e. decodable in a single way.

Encoders & Entropy

How does that relate to entropy?

Think about the optimal encoder: that will be the encoder that assigns, on average, the least amount of bits possible to an event of your distribution.

In the example 2 above, we considered A,B,C to be equally likely; but what if C was more probable than A and B? Wouldn’t it be better then to assign the single bit to C and two bits to A and B?

In general, to achieve optimality, we need to assign less bits to more probable outcomes and more bits to less probable outcomes.

A natural question is then: what is the minimum number of bits we can use to encode events drawn from a given probability distribution?

The answer is… the entropy!

Shannon's source coding theorem - Wikipedia

To clarify: the entropy is the theoretical minimum, but in practice you may not come up with an encoder that uses `entropy` number of bits on average.

Now that we’re equipped with this new insight, let’s tackle the next concepts!

Cross Entropy

Let’s say I have a machine that produces random letters (a-z) according to a certain unknown probability distribution P = [p(a), p(b), …, p(z)].

Your task is to create an encoder that is as efficient as possible for data coming from this distribution, i.e. an encoder that uses, on average, the least amount of bits possible to encode events from this distribution.

We know from earlier that the optimal encoder uses, on average, a number of bits equal to the entropy of the distribution, H(P). But for this you need to know the exact distribution, and here you don’t!

Therefore, you will have to guess what the true distribution is and produce an encoder based on your guess. Let’s call your guessed distribution Q = [q(a), q(b), ..., q(z)]. The average number of bits used by your encoder for Q will be higher or equal to H(P); and the actual amount is called the cross entropy between P and Q.

The cross entropy between P and Q is the average number of bits needed to encode events from P using an optimal encoder for Q. We denote that number H(P, Q).

Said differently, it means that you were expecting data from a probability distribution Q, but in reality the data belonged to a probability distribution P. And the average amount of bits used to encode those events from P (while expecting they were drawn from Q) is what we call the cross entropy.

Can you guess the formula?

Cross Entropy between P and Q

This looks very much like H(Q), but the information is weighted by the probabilities coming from P. This makes sense:

You will be using Q to encode events coming from the machine, therefore the information content will be calculated using q(x). However, the actual weighting of the information from each event comes from P since that is the true frequency of the events.

Notice that H(P, Q) ≠ H(Q, P). The result changes depending on which is the true distribution and which is the guessed distribution.

Also, notice that if you had guessed P perfectly well (Q=P), then the result should be the theoretical minimum number of bits possible to encode events from P, that is the entropy:

Cross Entropy with a distribution and itself is the Entropy

Relative Entropy

Lastly, the relative entropy, also known as the KL divergence.

If you’ve understood the cross entropy, then this should be a piece of cake!

Recall that the cross entropy is average number of bits used if you encode events drawn from a distribution P while expecting the events to come from a distribution Q. We said this number must be higher or equal to H(P) since that would be the number of bits used by a perfect encoder for P.

The number of extra bits used relative to H(P) is what we call the relative entropy and we denote it KL(P||Q)! That is, not the entire entropy but just the extra you used due to the error in guessing P.

Essentially, that is the difference in bits used by a suboptimal encoder and an optimal encoder. So this is simply H(P, Q) - H(P).

Relative Entropy, aka KL Divergence

Like the cross entropy, the relative entropy is not commutative: KL(P||Q) ≠ KL(Q||P). You can understand it as a measure of relative difference between two probability distributions, the minimum being 0 when Q=P.

Last Note

In machine learning we try to minimise the cross entropy:

Cross Entropy

Where P is the distribution of the data, and Q is the distribution of the model. Since the data doesn’t change during the training —H(P) is a constant— we are essentially minimising the relative entropy, i.e. the difference between P and Q.

Interestingly, in the context of LLMs (Large Language Models), when we minimise the cross entropy and therefore minimise the relative entropy, the loss we end up with after training is an approximation (as KL goes to 0) of the entropy of the data distribution, that is, the entropy of language.

Information & Entropy was originally published in Data Science Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.

Explaining the blockchain to Paloma

Omar Aflak — Mon, 19 Sep 2022 17:34:51 GMT

Dear Paloma, this is the story of the blockchain… in a paragraph of 4 pages.

Bitcoin

The story starts in 2009, after the subprime crisis. A guy, girl, or a group of person that goes by the pseudonym Satoshi Nakamoto started thinking about a way to have digital money that does not depend on a central authority to be trusted (referring here to the banks and governments, as the subprime crisis was a succession of bad human decisions). Satoshi came up with a protocol and wrote a paper titled:

Bitcoin: a peer to peer electronic cash system

https://www.ussc.gov/sites/default/files/pdf/training/annual-national-training-seminar/2018/Emerging_Tech_Bitcoin_Crypto.pdf

To understand the beauty of the protocol, you need to understand why it’s a challenge in the first place.

Digital money means giving value to something on a computer. But, you have used a computer, Paloma, and you know that any information can be easily duplicated or modified. If a file on your laptop contained the amount of dollars you possess, then nothing could prevent you from opening that file and changing that number: but digital money is… digital, so it has to be stored somewhere on a computer, and therefore we should have a way to prevent people from tampering with that information.

The mysterious protocol

In a nutshell (and simplified), this is what Satoshi described in the paper.

The idea is to connect hundreds of thousands, even millions, of computers together in a giant network, and have them all keep a full history of all the transactions ever made. If person A wants to send x Bitcoins to person B, then that information goes over the whole network and every computer is made aware that A sent x Bitcoins to B.

A and B identities are kept anonymous: all users in the network are represented by an identifier, but nothing can link an identifier to a person in the real world.

Since all computers know that A sent x Bitcoins to B, and since they hold all previous transactions ever made, they can look at all the transactions involving A for instance, to know exactly how much this account should possess now (sum of everything that identifier ever received minus sum of everything that identifier ever sent).

Why does that solve the problem?

Imagine a person that has a computer that is part of the network decides to cheat, and they change the information in their own history to include a transaction stating they received Bitcoins from someone. All other computers in the network (millions) will have a different transaction history. If 1 vs 1,000,000 have a different history, you can safely assume that the 1 has tempered with their data.

This is basically how it works. The information that is present in majority in the network is the one considered as the true information.

In other words, to change some information in the network, you would need 51% of the network to agree to temper with their history in the same way (this is a real thing, actually called the “51% attack”), but this becomes very unlikely as the number of computers in the network grows.

Blockchain

You know that network we’ve been talking about? This is the blockchain, Paloma!

The reason it’s called that way, is because the transactions are not stored one by one in the history, they’re grouped into chunks (then some cryptographic stuff is applied, let’s not go there) and added to the history: essentially forming a chain of blocks, hence the name. The term “blockchain” was not even mentioned in the original paper that Satoshi wrote, but it was later attributed to the whole protocol he described.

The blockchain is born from Bitcoin.

Amazing. Let’s keep going.

Decentralised Apps

Now, if you think about it, Bitcoin is just one type of information that lives on the blockchain (it’s just a number representing an amount of currency). But really, you could write anything you want in that history: for instance how much you like me :)

The blockchain is essentially a public book anyone can write information to, but once written it can never be deleted, it’s there forever.

You can write new information though, so if you decide that you don’t like me anymore (highly unlikely), you can write it to the blockchain, but it won’t overwrite what you first wrote: history will show that you once liked me…

Why would you write anything other than bitcoin to the blockchain? Well, you can imagine all sorts of applications, for example to hold a patent on one of your creations. Instead of going to a patent organisation and paying a lot for them to keep a record and act as a figure of authority (a source of truth), you can instead write this information to the blockchain. Because nobody can temper with the information over there, you can prove by looking at the date you sent your data that it was you indeed who thought of your creation first. In that sense, the blockchain acts as a source of truth, and it doesn’t require any central authority to take care of it: it regulates itself.

These sorts of applications are called “decentralised applications” or “de-apps” (decentralised refers to the fact that the information is not stored in one computer, but many).

Smart contracts

After Bitcoin, many other cryptocurrencies created their own blockchains. One very popular is the Ethereum blockchain. Ethereum paved the way for something amazing which is now the true power of the blockchain. They allowed for people to send pieces of code to the blockchain: these are called smart contracts, and they are able to (if you program them to) transfer ether (the Ethereum currency) from one account to another, many other things you can think of, and all the ones you cannot :)

Once the code is on the blockchain, you can interact with it and it will always get executed in the same way. So you could imagine a smart contract for a universal basic income for example. If the government programmed such a contract to withdraw funds from their own ether account and give that ether to any account who asks the contract for it (under the constraint that they do it once a month) and publish it to the Ethereum blockchain, then that’s it, you have it.

Why would you publish this on the blockchain as a smart contract rather than having it in any other way? Great question Paloma.

Imagine someone in the government decides to spookily stop this universal income or withdraw more than once a month from the state’s wallet: it’s impossible (given the contract was programmed correctly). Once the code is on the blockchain, it’s there forever and it’s set to be executed as it was planned to from day 1, and nothing can change that. Of course if you want some flexibility for future changes, you can code it in the contract; but the code is public so anyone can see what the contract is set to do in advance.

DAO

DAO, since you asked, stands for “decentralised autonomous organisation” (a bit of a superfluous term in my opinion) and it’s supposed to be an organisation that has some parts of its business logic decentralised (via smart contracts for instance), so it’s not entirely controlled by the people of the organisation.

Conclusion

Paloma, I hope this helped you in some way or another. Hit me up in the comments if you have any questions.

I hope to see you in October 🌹

Solving Non-Linear Differential Equations numerically using the Finite Difference Method

Omar Aflak — Mon, 01 Feb 2021 19:21:00 GMT

Solve Non-Linear Differential Equations numerically using the Finite Difference Method

UPDATE: Check out the updated version of this article at: https://omaraflak.com/articles/finite-difference-method

In this article we will see how to use the finite difference method to solve non-linear differential equations numerically. We will practice on the pendulum equation, taking air resistance into account, and solve it in Python.

We will find the differential equation of the pendulum starting from scratch, and then solve it. Before we start, we need a little background on Polar coordinates.

Polar Coordinates

You already know the famous Cartesian coordinates, which are probably the most used in everyday life. However, in some cases, describing the position of an object in Cartesian coordinates isn’t practical. For instance, when an object is in a circular movement, sine and cosine functions are going to pop all over the place, so it’s generally a much better idea to describe that object’s position in what we call Polar coordinates.

Polar coordinates are described by two variables, the radius ρ and the angle θ. We attach unit vectors to each variable:

eρ is a unit vector always pointing in the same direction as vector OM.
eθ is a unit vector perpendicular to eρ.

Our goal now is to express the position, velocity, and acceleration of an object in Polar coordinates. For this we need to express the relationship between the Polar unit vectors and the Cartesian unit vectors.

Cartesian to Polar

Polar to Cartesian

All right! Let’s express position, velocity, and acceleration in Polar coordinates.

Position

This one is quite easy. It’s the whole point of using Polar coordinates!

Position in Polar coordinates

Velocity

We simply differentiate the position with respect to time. We will assume ρ is a constant, and only θ varies over time.

Velocity in Polar coordinates

Acceleration

We differentiate the velocity with respect to time.

Acceleration in Polar coordinates

All right! We can now work on our problem: the pendulum.

Pendulum Equation

In order to find the equation that the angle θ satisfies, we will use Newton’s second law of motion, or as we call it in French, the fundamental principle of dynamic.

Newton’s second law of motion

The sum of all the forces applied to a system is equal to its mass times its acceleration. Let’s enumerate all the forces applied to the pendulum and express them in Polar coordinates.

Weight

The weight of the object due to gravity is one the forces applied to the object. Its formula is well known, and will be expressed in our coordinate system as:

Weight

Where m (kg) is the mass of the object, and g (m/s²) is value of the acceleration of gravity — which is about 9.81 on earth.

Rope Tension

The rope exerts a tension pulling the pendulum in the direction of the rope.

Rope tension

Where R (N) is the rope tension in Newtons.

Air Resistance

Of course, the air exerts a friction on the pendulum, which will make it stop oscillating at some point. Small air resistance is usually modeled as a force opposite to the velocity vector and proportional to the norm of the velocity vector.

Air resistance

Where k (kg/s) is the friction coefficient that is specific to the object in movement, and L (m) is the length of the pendulum rope.

Newton’s second law of motion

We can now apply Newton’s second law of motion:

Newton’s second law of motion applied to the pendulum

Then project the result on both axes:

Projection

Reordering the terms of the second equation, we get:

Pendulum equation

Solving this second order non-linear differential equation is very complicated. This is where the Finite Difference Method comes very handy. It will boil down to two lines of Python! Let’s see how.

Finite Difference Method

The method consists of approximating derivatives numerically using a rate of change with a very small step size.

Derivative — Rate of change

That is the very definition of what a derivative is. Numerically, if we knew f, we could take a small number h — e.g. 0.0001 — and compute the above formula for a given x, which would give us an approximation of f’(x).

The finite difference method simply uses that fact to transform differential equations into ordinary equations.

In our case, we start by expressing θ’’ with respect to θ’ using the rate of change.

I removed the limit, and wrote dt for us to know that this is supposed to be an infinitesimal value — in practice, just a very small number. We will now plug this equation into the pendulum equation.

Okay! We managed to express the angular velocity at time t+dt with respect to the angle and angular velocity at time t. In other words, if for instance dt=0.001 and if you know θ(0) and θ’(0) (which are the initial conditions of the system), then you can compute θ’(0.001)! If we could also compute θ(0.001) then the recursion is complete and we can compute [θ(t), θ’(t)] for any t starting with known initial conditions.

Fortunately, there is a way to compute θ(t+dt):

All right! With that equation in hand we can also compute the angle at time t+dt given the angle and the angular velocity at time t.

Using these two equations we can now compute the angle theta at any time step!

Indeed, given [θ(0), θ’(0)] you can compute [θ(dt), θ’(dt)]. Given [θ(dt), θ’(dt)] you can compute [θ(2dt), θ’(2dt)], and so on.

Let’s put all this to use in a Python program!

Python Code

https://medium.com/media/ad29c9483ede60c16266729600b9e16e/href

We iteratively compute θ(t) and θ’(t) using the formulas we found, and put the results in two separate lists. Hopefully the code is understandable, but feel free to drop a comment if you have any question.

Running the code will produce the following plot.

Two happy observations:

The angular velocity seems to reach extremums when the angle is zero, which makes sens since this is where the pendulum has accumulated all its inertia and is about to slow down because it’s going up.
The angular velocity seems to reach zero when the angle reaches an extremum, which makes sens since this is when the pendulum is slowing down and is about to go in the other direction.

Playing with the code a little, you might want to set the initial velocity to 2π for instance.

Notice how the angle keep increasing before going down. What happened is that the initial velocity was high enough to make the pendulum make a full spin before entering the usual oscillation!

One last thing… You can try to increase dt and see how this affects the simulation. Hopefully you’ve understood that a smaller dt means more accurate results. Let’s see what happens for N=3,2,1.

I was actually surprised to see that for only 3 points per second (and even 2), we still manage to get the general shape of the solution. N=1 is another story…

Conclusion

In this article we have seen how to use the finite difference method to solve differential equations (even non-linear) and we applied it to a practical example: the pendulum. This technique also works for partial differential equations, a well known case is the heat equation.

Solving Non-Linear Differential Equations numerically using the Finite Difference Method was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Ray Tracing From Scratch in Python

Omar Aflak — Sun, 26 Jul 2020 15:50:17 GMT

Create a computer-generated image using the Ray Tracing algorithm coded from scratch in Python.

fig. 1 — computer-generated image

UPDATE: Check out the updated version of this article at: https://omaraflak.com/articles/ray-tracing

In this post I will give you a glimpse of what computer graphics algorithms may look like. I will explain the ray tracing algorithm and show a simple implementation in Python.

By the end of this article you’ll be able to make a program that will generate the above image, without making use of any fancy graphic library! Only NumPy. Isn’t it crazy?! Let’s dive in!

P.S. This article is by no mean a complete guide / explanation of ray tracing, since this is such a vast subject, but rather an introduction for curious people :)

Prerequisites

We only need very basic vector geometry.

If you have 2 points A and B — whatever the dimensionality: 1, 2, 3, …, n — then a vector that goes from A to B can be found by computing B — A (element-wise);
The length of a vector — whatever the dimensionality — can be found by computing the square root of the sum of the squared components. The length of a vector v is denoted ||v||;
A unit-vector is a vector of length 1: ||v|| = 1;
Given a vector, another vector that points to the same direction but with a length of 1 can be found by dividing each component of the first vector by its length — this is called normalization: u = v / ||v||;
Dot product for vectors. Specifically: = ||v||²;
Solving a quadratic equation;
A bit of patience and imagination;

Ray Tracing Algorithm

In effect, ray tracing is a rendering technique that simulates the path of light and intersections with objects and is able to produce images with a high degree of realism. More optimized variations of this algorithm are actually used in video games!

To explain the algorithm we need to setup a scene:

We need a 3D space (that simply means we’re going to use 3 coordinates to position objects in space);
We need objects in that space (since we’re going to reproduce fig. 1, imagine spheres);
We need a source of light (this is going to be a single point emitting light in all directions, so in essence a single position);
We need an “eye” or a camera to observe the scene (again, simply a position);
Since the camera could be looking anywhere really, we need a screen through which the camera will be observing the objects (4 positions for the four corners of a rectangular screen);

fig. 2

A word about the screen: the screen is going to occupy a certain amount of space that you will define (could be a 3x2 rectangle for instance). But 3 and 2 don’t really mean anything alone. They do mean something when you compare them to the sizes of other objects, they are relative. What’s important here, is how you will split that rectangle into smaller squares (pixels), akin the figure above. This is going to determine the size of the final image. In other words, you can create a 3x2 rectangle and split it into 300x200 pixels, that will work just fine.

Given the scene, this is the ray tracing algorithm:

for each pixel p(x,y,z) of the screen:
    associate a black color to p
    if the ray (line) that starts at camera and goes towards p intersects any object of the scene then:
        calculate the intersection point to the nearest object
        if there is no object of the scene in-between the intersection point and the light then:
            calculate the color of the intersection point
            associate the color of the intersection point to p

fig. 3

Note that this process is actually the reverse process of real-life illumination. In reality, light comes out of the source in all directions, bounce on objects and hits your eye. However, since not all rays coming out the light source will end up in your eye, ray tracing does the reverse process to save computation time (trace rays from the eye back to the light source).

This is all purely geometrical, the only thing I didn’t explain is how to calculate the color of the intersection point. This is isn’t necessary right now, so I will explain it later. Just know there exist physical models that describe how objects are illuminated when light strikes on them with a certain angle, intensity, etc.

At the end of the algorithm we will have filled the screen with the correct colors, and we can just save it as an image.

Setup the scene

Before starting to code, we need to setup a scene. For now we will decide where the camera and the screen are located. For our purpose, we will make things simple by aligning them with the unit axes.

fig.4 — scene

Hence, the camera is located at the point (x=0, y=0, z=1) and the screen is part of the plane formed by the x and y axes. With that being set up, we can already write the skeleton of our code.

https://medium.com/media/7d6480628ba0b76d024ce95a68ba47e1/href

The camera is just a position, 3 coordinates;
The screen on the other hand is defined by four numbers (or two points): left, top, right, bottom. It ranges from -1 to 1 in the x direction (this is arbitrary), and ranges from -1 / ratio to 1 / ratio in the y direction, where ratio is image width / image height. The reason for this is simple: we want the screen to have the same aspect ratio than the actual image we want to produce. Setting up the screen this way will produce an aspect ratio of (width over height): 2 / (2 / ratio) = ratio which is the ratio of the desired image of 300x200;
Finally, the loop consists of splitting the screen into width and height points in the x and y directions respectively, then computing the color of the current pixel;

You can actually run that code and it will produce — as expected for now — a black image. If you look back at the pseudo-code then this is what we accomplished.

✅ for each pixel p(x,y,z) of the screen:
✅    associate a black color to p
    if the ray (line) that starts at camera and goes towards p intersects any object of the scene then:
        calculate the intersection point to the nearest object
        if there is no object of the scene in-between the intersection point and the light then:
            calculate the color of the intersection point
            associate the color of the intersection point to p

Ray intersection

The next step of the algorithm is: if the ray (line) that starts at camera and goes towards p intersects any object of the scene then.

Let’s break it down into two parts. First, what is the ray (line) that starts at camera and goes towards p ?

Ray definition

We say “ray” but that’s really just another word for “line”. In general, whenever you code something that is geometrical, you should prefer vectors over actual line equations, they really are easier to work with and are much less prone to errors such as division by zero.

So, since the ray starts at the camera and goes in the direction of the currently targeted pixel, we can define a unit-vector that points to a similar direction. Therefore, we define a “ray that starts at camera and goes towards pixel” as the following equation:

eq. 1 — ray

Remember, camera and pixel are 3D-points. For t=0 you end up at the camera position, and the more you increase t the further away you get from the camera in the direction of the pixel. This is a parametric equation, that yields a point along the line for a given t.

Of course, there is nothing special about camera or pixel, we can similarly define a ray that starts at origin (O) and goes towards destination (D) as:

eq. 2 — ray

For convenience, we define d as the direction vector.

We can now complete the code and add the computation of the ray.

https://medium.com/media/2820feca428fd54d50476c20a6377b58/href

We’ve added the normalize(vector) function that returns a… normalized vector;
We’ve added the computation of origin and direction which both together define a ray. Notice that pixel has z=0 since it lies on the screen which is contained in the plane formed by the x and y axes;

Now we get to the second part which is intersects any object of the scene then. That is basically the “hard” part. The computation is going to be different for each type of objects we will dealing with (spheres, planes, triangles, etc.). For the sake of simplicity, we will only render spheres. So for the next part we will see:

How we define a sphere;
How we compute the intersection point between a ray and a sphere, if it exists;

Sphere definition

A sphere is actually a pretty simple mathematical object to define. A sphere is defined as the set of points that are all at the same distance r (radius) from a given point (center).

Therefore, given the center C of a sphere, and its radius r, an arbitrary point X lies on the sphere if and only if:

eq. 3— sphere equation

For convenience, we square both sides to get rid of the square root caused by the magnitude of X — C.

eq. 4— sphere intersection

We can already define some spheres just after the screen declaration.

https://medium.com/media/0f95485eef55d3fe15cf04d60ac8b82c/href

Now let’s compute the intersection between a ray and a sphere.

Sphere Intersection

We know the ray equation, and we know what condition a point must satisfy so that it lays on a sphere. All we have to do is plug eq. 2 into eq. 4 and solve for t. Which means, answering the question: for which t, ray(t) will be on the sphere ?

eq. 5 — sphere intersection

This is an ordinary quadratic equation that we can solve for t. We will call the coefficients associated with t², t¹, t⁰ a, b, and c respectively. Let’s calculate the discriminant of that equation:

eq. 6 — discriminant

Since d (direction) is a unit-vector, we have a=1. Once we calculate the discriminant of that equation, there are 3 possibilities:

fig. 5— sphere discriminant

We will only use the third case to detect intersections. Here’s a function that can detect intersections between a ray and a sphere. It will return t the distance from the origin of the ray to the nearest intersection point if the ray actually intersects the sphere, and it will return None otherwise.

https://medium.com/media/5043983dc3b29c548d71d7ab6ce52bc0/href

Notice that we only return the nearest intersection (because there are 2) only when both t1 and t2 are positive. This is because a t that solves the equation could be negative, but it would mean that the ray that intersects the sphere doesn’t have d as a direction vector, but -d (for instance if the sphere is behind the camera and the screen).

Nearest intersected object

All right, so far so good, but we still haven’t completed the instruction from the pseudo-code which was: if the ray (line) that starts at camera and goes towards p intersects any object of the scene then[...]. Good news is, we can do this and the next instruction in one strike! The next instruction is: calculate the intersection point to the nearest object.

We can easily create a function that uses sphere_intersect() to find the nearest object that a ray intersects, if it exists. We simply loop over all the spheres, search for intersections, and keep the nearest sphere.

https://medium.com/media/3a406f6386e91f15c89b9f7273e7b9e3/href

When calling the function, if nearest_object is None then there is no object intersected by the ray, otherwise its value is the nearest intersected object and we get min_distance, the distance from the ray origin to the intersection point.

Intersection point

In order to compute the intersection point, we use the previous function:

nearest_object, distance = nearest_intersected_object(objects, o, d)
if nearest_object:
    intersection_point = o + d * distance

Hooray! We’ve completed the second and the third instructions. This is the code we have until now:

https://medium.com/media/0ac3125dac39d11d81aafea4a764d34c/href

✅ for each pixel p(x,y,z) of the screen:
✅    associate a black color to p
✅    if the ray (line) that starts at camera and goes towards p intersects any object of the scene then:
✅        calculate the intersection point to the nearest object
        if there is no object of the scene in-between the intersection point and the light then:
            calculate the color of the intersection point
            associate the color of the intersection point to p

Light intersection

So far, we know if there is a straight line that goes from the camera/eye to an object, and we know which object this is, as well as exactly what part of the object we’re looking at. What we don’t know yet is if that specific point is illuminated at all! Maybe the light isn’t striking on that particular point, and so there is no need to go further because we cannot see it. Therefore, the next step is to check if there is no object of the scene in-between the intersection point and the light.

Fortunately, we already have a function to help us: nearest_intersected_object(). Indeed, we want to know if the ray that starts at the intersection point and goes towards the light is intersecting an object of the scene before crossing the light. This is practically the same task as previously, we just need to change the ray origin and direction. First, we need to define a light. You can add this near the objects declaration:

light = { 'position': np.array([5, 5, 5]) }

To check if an object is shadowing the intersection point, we have to pass the ray that starts at the intersection point and goes towards the light, and see if the nearest object returned is actually closer than the light to the intersection point (in other words, in between).

https://medium.com/media/bad83c1a583fa1313d2596ee07303f90/href

Looks neat, doesn’t it ? Well this is not going to work… We need to make a slight adjustment. If we use the intersection point as the origin of the new ray we might end up detecting the sphere where we currently stand as an object in between the intersection point and the light. A quick and widely used fix for that problem is to take a little step that gets us away from the surface of the sphere. We generally use a normal vector to the surface and take a little step in that direction.

fig. 6 — sphere normal step

This trick isn’t used only for spheres, but for any kind object.

Therefore, the correct code is:

https://medium.com/media/818455d956ebf9bd323c1809d8fc1040/href

✅ for each pixel p(x,y,z) of the screen:
✅    associate a black color to p
✅    if the ray (line) that starts at camera and goes towards p intersects any object of the scene then:
✅        calculate the intersection point to the nearest object
✅        if there is no object of the scene in-between the intersection point and the light then:
            calculate the color of the intersection point
            associate the color of the intersection point to p

Blinn-Phong reflection model

This is it, the last part. We know a light beam has stroke the object, and the reflection of the beam got straight into the camera. The question is: What does the camera see ? This is what the Blinn-Phong model attempts to answer.

FYI: The Blinn-Phong model is an approximation to the Phong model that is less computationally intensive.

According to this model, any material has 4 properties:

Ambient color: color that an object is suppose to have in absence of light. It’s hard to imagine, since we only see objects when light strikes on them, but generally this is a dim color tinted with the actual color you imagine;
Diffuse color: color that is the closest to what we think of when we say “color”;
Specular color: color of the shiny part of an object when light has stroke on it. Most of the time this is white;
Shininess: a coefficient representing how shiny an object is;

Note: All colors are RGB representations in the range 0–1.

Phong reflection model — Wikipedia

So every object of the scene must have these 4 properties. Let’s add them to the spheres.

https://medium.com/media/5fc7b3d93d236003ad43a2e5d99e3736/href

In this example, the spheres are red, magenta, and green respectively.

The Blinn-Phong model states that light also has the three color properties: ambient, diffuse and specular. Let’s add them too.

https://medium.com/media/86034daf3ca6dc89ae8b5c7f78b52247/href

Given these properties, the Blinn-Phong model calculates the illumination of a point as follows:

eq. 7 — Blinn-Phong model

where,

ka, kd, ks are the ambient, diffuse, specular properties of the object;
ia, id, is are the ambient, diffuse, specular properties of the light;
L is a direction unit vector from the intersection point towards the light;
N is the unit normal vector to the surface of the object at the intersection point;
V is a direction unit vector from the intersection point towards the camera;
α is the shininess of the object;

https://medium.com/media/d18874b912a481783dc373e9c2963b6d/href

Notice that at the end, we bound the color between 0 and 1 to make sure it’s in the correct range.

✅ for each pixel p(x,y,z) of the screen:
✅    associate a black color to p
✅    if the ray (line) that starts at camera and goes towards p intersects any object of the scene then:
✅        calculate the intersection point to the nearest object
✅        if there is no object of the scene in-between the intersection point and the light then:
✅            calculate the color of the intersection point
✅            associate the color of the intersection point to p

Run the code!

Increase width and height for a higher resolution (at the cost of your time).

fig. 5 — first result

Wow that’s cool! However, you may notice 2 things that differ from the first image I’ve shown at the beginning. Go ahead, take a look back.

The grey floor is missing;
There are no reflections (mirror effect) in this picture;

Let’s address these two points.

Fake plane

Ideally we would create another type of object, a plane, but because we’re lazy we can simply use another sphere. How ? Well, if you’re standing on a sphere that has an infinitely large radius (compared to your size), then you’ll feel like you’re standing on a flat surface. Just like earth :)

Add this sphere to your list of objects, and render again!

{ 'center': np.array([0, -9000, 0]), 'radius': 9000 - 0.7, 'ambient': np.array([0.1, 0.1, 0.1]), 'diffuse': np.array([0.6, 0.6, 0.6]), 'specular': np.array([1, 1, 1]), 'shininess': 100 }

Reflection

Right now, we render rays that: come out the light source, hit the surface of an object, then directly bounce towards the camera. What if the ray hits multiple objects before hitting the camera ? This is reflection. The ray will accumulate different colors and when it strikes the camera you will see reflections. Let’s do it.

Each object has a reflection coefficient in the range 0–1. “0” means the object is matte, “1” means the object is like a mirror. Let’s add a reflection property to all the spheres:

{ 'center': np.array([-0.2, 0, -1]), ..., 'reflection': 0.5 }
{ 'center': np.array([0.1, -0.3, 0]), ..., 'reflection': 0.5 }
{ 'center': np.array([-0.3, 0, 0]), ..., 'reflection': 0.5 }
{ 'center': np.array([0, -9000, 0]), ..., 'reflection': 0.5 }

Algorithm

Currently, we compute a ray that starts at the camera and goes towards a pixel, then we trace that ray into the scene, check for the nearest intersection and compute the intersection point color.

In order to include reflections, we need to trace the reflected ray after an intersection happen and include the color contribution of each intersection point. We repeat that process some number of time (to define).

fig. 6 — reflection

Color computation

In order to get the color of a pixel, we need to sum the contribution of each intersected point by the ray.

eq. 8 — color computation

where,

c is the (final) color of a pixel;
i is the illumination computed by the Blinn-Phong model of the #index intersection point;
r is the reflection of the #index intersected object;

Then it’s up to you to decide when to stop computing that sum (i.e. when to stop tracing reflected rays).

Reflected ray

Before we’re able to code this, we need to find the reflected ray direction. We can compute a reflected ray the following way:

eq. 9 — reflection

fig. 7— reflection diagram

where,

R is the normalized reflected ray;
V is a direction unit vector of the ray to be reflected;
N is the direction unit vector normal to the surface the ray stroke;

Add this method at the top of the file along with the normalize() function:

https://medium.com/media/cd229bb5a3b6fa4dc53785aff5198f25/href

Code

Time to code this. It’s actually a small change at the end. Simply make the following changes:

https://medium.com/media/48dd52e2200e60e43624027b70cb280f/href

Important: Now that we have put the intersection code in another loop for reflection, we should use break statements where we previously used continue statements, in order to avoid useless computations.

That’s it! Run the code and observe the beautiful result!

Final Code

The final code is surprisingly small, about a hundred lines of code!

https://medium.com/media/0235108288c76013713fcc5cc7b897e2/href

What’s next ?

This was a very simplistic program that was meant to educate on the subject. There are so many ways to improve this and implement other fascinating functionalities. Here are some of them:

OOP! Right now we’ve put all the objects in a dict, but you could make classes, figure out what’s specific to spheres and what’s not, make a base class, and implement other objects such as planes or triangles;
Same thing goes for light. Add some POO here and make it so you can add multiple lights in the scene;
Separate the material properties from the geometrical properties, to be able to apply one material (e.g. blue matte) to any type of objects;
Figure out a way to position the screen correctly given any camera position and a direction to look at;
Model the light differently. Currently it’s a single point, which is why the shadows of objects are “hard” or well defined. In order to get “soft” shadows (with a gradient basically), you need to model a light like a 2d or 3d object: disk or sphere?

Bonus

Here’s an animation I made with ray tracing. I simply rendered the scene several times with the camera at different positions.

https://medium.com/media/07003a129238435e06b22134c7f5d2fd/href

The code is in Kotlin (you’ll notice then how much python is slow…) and available on GitHub if you’re interested.

OmarAflak/RayTracer-Kotlin

Conclusion

Congratulation if you made so far! I hope you enjoyed this fascinating subject and don’t hesitate to comment for any question. For further readings on that matter I would highly advise the following website:

Scratchapixel

Cheers !

Bézier Interpolation

Omar Aflak — Sat, 09 May 2020 08:41:17 GMT

Create smooth shapes using Bézier curves.

UPDATE: Check out the updated version of this article at: https://omaraflak.com/articles/bezier-interpolation

In this article, we will see how we can use cubic Bézier curves to create a smooth line that goes through a predefined set of points. If you don’t know what Bézier curves are, you might want to check out this post I have written which could do as an introduction, or simply browse Wikipedia!

Bézier Curve

Cubic Bézier Curves

The goal is to fit n+1 given points (P0, …, Pn). In order to fit these points, we are going to use one cubic Bézier curve (4 control points) between each consecutive pair of points.

fig. 1

So in this figure, G0, G1, and G2 are three different cubic Bézier curves that start and end at (P0, P1), (P1, P2), and (P2, P3) respectively. Since any Bézier curve always starts and ends at the first and last control points, we are left with 2 control points for each curve that we will have to find so that the resulting line looks smooth.

You might want to refer back to fig. 1 during the article to understand the indices we will be using.

The general equation of the cubic Bézier curve is the following:

Where K are the 4 control points. In our case, K0 and K3 will be two consecutive points that we want to fit (e.g. P0-P1, or P1-P2, etc.), and K1 and K2 are the remaining 2 control points we have to find.

Problem Setup

Given that we have n+1 points to fit, we will use a cubic Bézier curve to fit each consecutive par of points. We denote Γi the Bézier curve that fits Pi to Pi+1:

eq. 2

Where ai and bi are left to find. Notice that there are n curves.

If we want the final curve to be smooth, we need to ensure that the transition between Γi and Γi+1 is smooth around Pi+1. In other words, that the curvature of Γi matches the curvature of Γi+1 around Pi+1. Mathematically, this means respecting the following conditions:

eq. 3, 4

We need to find all the ai and bi. Since we have one pair of them in each Bézier curve, and since we have n curves, we need to find 2n variables. However, here we have 2(n-1) equations. We are missing 2 equations to solve the system. Therefore, we impose the following (arbitrary) boundary conditions:

eq. 5, 6

Write the system

Before solving the system, we need to calculate the first and second derivatives of Γi and write the system down.

eq. 7

and,

eq. 8

Inject the equations

Injecting eq. 7 into eq. 3:

eq. 9

Injecting eq. 8 into eq. 4:

eq. 10

Injecting eq. 8 into eq. 5:

eq. 11

Finally, injecting eq.8 into eq. 6:

eq. 12

Solve the system

To sum up, we have the following 2n equations:

eq. 9, 10, 11, 12

In order to solve the system are going to eliminate all the bi by injecting eq. 9 into eq. 10, 11, 12. Using eq. 9:

eq. 13

Injecting eq. 13 into eq. 10:

eq. 13, 14

Injecting eq. 13 into eq. 11:

eq. 15

Almost there! We now need to inject eq. 13 into the fourth equation, eq. 12. However, eq. 12 has bn-1 but eq. 13 holds until bn-2. Good news, we can use eq. 14 to get rid of bn-1 then inject bn-2.

eq. 16

All right! To summarise we have in this order eq. 15, 13, 16:

eq. 15, 13, 16

We can write this system as a matrix multiplication and solve it. Hold on, we’re there!

eq. 17

As you can see, the first matrix has only 3 diagonal filled with values, others are zeros. This kind of matrix is called, reasonably enough, a tridiagonal matrix. Algorithms exist to solve this type of systems efficiently, such as Thomas Algorithm which runs linearly in time. For the sake of simplicity, we won’t bother optimising further and we will simply use the built-in functions of Numpy in Python to solve the system.

However, we are still missing the bi points. To find these we simply make use of eq. 13 which works out all the bi up until bn-2 and then eq. 12 which gives the last term, bn-1.

eq. 13, 12

We are done at last! Let’s see how we can program this using Python.

Python Code

I did my best to make the code as clear as possible, and I added comments. If you can’t get your head around something don’t hesitate to comment!

https://medium.com/media/157329425eb5f36f2d7b4fafcb8ef274/href

Finally, this is how we would use this code:

https://medium.com/media/db83a7ff396f98ed5524baf837b12ce3/href

In my case, I got this:

fig. 2

This is it for this article! I hope you enjoyed it and don’t hesitate to comment any question you might have!

Bézier Curve

Omar Aflak — Sat, 02 May 2020 17:18:54 GMT

Understand the mathematics of Bézier curves

UPDATE: Check out the updated version of this article at: https://omaraflak.com/articles/bezier

Bézier curves are used a lot in computer graphics, often to produce smooth curves, and yet they are a very simple tool. If you have ever used Photoshop you might have stumbled upon that tool called “Anchor” where you can put anchor points and draw some curves with them… Yep, these are Bézier curves. Or if you have used vector-based graphic, SVG, these too use Bézier curves. Let’s see how it works.

Definition

Given n+1 points (P0, …, Pn) called the control points, the Bézier curve defined by these points is defined as:

eq. 1

Where B(t) is the Bernstein polynomial, and:

eq. 2

You will notice that this Bernstein polynomial looks a lot like the k(th) term in Newton’s binomial formula, which is:

eq. 3

In fact, the Bernstein polynomial is nothing but the k(th) term in the expansion of (t + (1 - t))^n = 1. Which is why if you sum all the Bi up to n, you will get 1. Any ways.

Quadratic Bézier Curve

The quadratic Bézier curve is how we call the Bézier curve with 3 control points, since the degree of P(t) will be 2. Let’s calculate the Bézier curve given 3 control points and explore some properties we might find! Remember, eq. 1 holds for n+1 points, so in our case n=2.

eq. 4

Mind that P(t) does not return a number, but a point on the curve. Now we just have to choose three control points and evaluate the curve on the range [0, 1]. We can do this in Python quite easily.

https://medium.com/media/ef672388a1b9d2092bc13159dca826af/href

You can notice that the curve starts and ends at the first and last control points. This result will be true for any number of points. Since t ranges from 0 to 1, we can prove this by evaluating P(t) at t=0 and t=1. Using eq. 1:

eq. 4

eq. 5

Because the curve goes from P0 to P2, in this case, P1 entirely determines the shape of the curve. Moving P1 around you might notice something:

The Bézier curve is always contained in the polygon formed by the control points. This polygon is hence called the control polygon, or Bézier polygon. This property also holds for any number of control points, which makes their manipulation quite intuitive when using a software.

Matrix representation

We can actually represent the Bézier formula using matrix multiplication, which might be useful in other contexts, for instance for splitting the Bézier curve. If we go back to our example we can rewrite P(t) as follows:

eq. 6

And so all the information about the quadratic Bézier curve is compacted into one matrix, M. Now, we might want to find the coefficients of that matrix without having to do all these steps, and in a way that is easily programmable. Since the coefficients of the matrix are simply the coefficients of the polynomial in front of each Pi, what we are looking for is the expanded form of the Bernstein polynomial eq. 2.

One more thing: if we expand Bi(t) we will get the polynomial in front of Pi, which corresponds to the i(th) column in the matrix. However, that is not really convenient and it would be easier to program if we could get rows instead. That said, you might notice that the i(th) row of the matrix is exactly the same as the reversed (n-i)(th) column, and the coefficients of the reversed (n-i)(th) column are nothing but the coefficients of B(n-i)(t) taken in decreasing powers of t.

eq. 7

You might want to refer to eq. 2 and eq. 3 if you’re having some troubles.

Therefore, the coefficients of the matrix are nothing but the coefficients in front of t, meaning:

eq. 8

https://medium.com/media/264321b5495ed7df13bf33cb76961886/href

Interpolation

One interesting application of Bézier curves is to draw a smooth curve going through a predefined set of points. The reason it is interesting is because the formula of P(t) produces points and is not of the form y=f(x), so one x can have multiple y’s (basically a function that can “go backward”). For instance we could draw something like this:

However, the mathematics to produce this result are not trivial so I’ve wrote a dedicated post for this:

Bézier Interpolation

In the meantime, here is how you can program the general version of the Bézier curve for any number of control points using eq. 1.

https://medium.com/media/d452104de73512feaa49a618ffec6f99/href

Run the program and you will get the graph displayed in the header.

That’s it for this introduction to Bézier curves. I hope you learned something and don’t hesitate to comment any question that you might have!

A one night journey

Omar Aflak — Wed, 08 Apr 2020 07:58:44 GMT

What does it take ?

Photo by Kid Circus on Unsplash

I took an “entrepreneurial” course this year at my University.

I’ve heard about the top 10 techniques that every entrepreneur must know. I’ve heard about how I should talk to an audience to convince them. I’ve heard about the relentless mindset I should have, and never fear failure. I’ve heard about how I should have diversified skills among my team. I’ve heard about how important networking is. I’ve heard about the fall and the rise of dozens of people I don’t even know. I’ve heard about all the mistakes I should avoid, and everything I should do in order to succeed. Days, weeks, months have passed and I kept hearing those things.

Quarantined

Few weeks back from now, an evening, I was at my friend’s house, we were 3 having some pizzas after a skateboard session. We opened up the TV, we were awaiting for the French president to announce new measures related to COVID-19. That’s when the quarantine was announced. We were so happy.

Hysteria

While eating our pizzas, we started to imagine apps that we could create for the weird period ahead of us. Ideas were coming from all sides, we kind of went hysterical at that moment, but eventually we settled down.

The agreement

We thought that, during this period, it would probably be hard for some students to keep up with the work, especially if exams were coming. We were mostly thinking of teenagers, but really this can be true for anyone. That’s when we agreed on making a website that will connect students to anyone who’s willing to spend a bit of his time to teach a specific subject.

Most important part of all

We wanted to call this website covaide. You get the play on words (aide means aid, obviously) right? So, first thing we did? Buy a domain name! Of course. covaide.fr was born. Well… there was no website behind it yet, but still, it was born, for 5€. Finally ready to start coding.

The right tools

We used Firebase to host and deploy our not-ready-yet ReactJS website, which took about 10 minutes, no jokes. Firebase is a free(mium) tool made by Google, which gives you access to all the basic stuff you need to get started: Hosting, Database, Authentication, Storage, etc.

The End

We spent the night coding. One of us didn’t code, but we were all so excited that he stayed up all night watching us typing… Great guy. The day after, we had a prototype. It was super ugly, really. The whole site would break if you went on mobile. But it was working. We made a post online, it was shared about 200 times on Facebook and LinkedIn. We had a lot of feedbacks from friends but also complete strangers. People started to sign up on the website… Wow!

Then the government canceled all important exams for teenagers. lol.

Conclusion

I took an “entrepreneurial” course this year at my University, but truth is, that one little experience was worth a thousand words. So, the next time you’re feeling wings growing, remember that the worst case scenario is a sleepless pizza-night hackathon with your friends, 5€ for a stupid domain name, and a good story to share. Cheers!

Optimization — Descent Algorithms

Omar Aflak — Wed, 01 Apr 2020 01:25:40 GMT

Optimization — Descent Algorithms

In this post, we will see several basic optimization algorithms that you can use in various data science problems.

UPDATE: Check out the updated version of this article at: https://omaraflak.com/articles/optimization

Many algorithms used in Machine Learning are based on basic mathematical optimization methods. Discovering these algorithms directly in the context of Machine Learning might be confusing because of all the prerequisites. Thus, I think it might be a good idea to see these algorithms free of any context in order to get a better understanding of these techniques.

Descent Algorithms

Descent algorithms are meant to minimise a given function, that’s it. Really. These algorithms proceed iteratively, it means that they successively improve their current solution. You might think:

What if I want to find the maximum of a function ?

Simply, add a minus sign in front of your function, and it becomes a “min” problem!

Let’s dive in. This is our problem definition:

Our problem consists of finding a vector x* that will minimise a function f

One prerequisite you must know is that if a point is a minimum, maximum, or a saddle point (meaning both at the same time), then the gradient of the function is zero at that point.

1D case

Descent algorithms consist of building a sequence {x} that will converge towards x* (arg min f(x)). The sequence is built the following way:

Sequence we try to build in order to get to x*

Where k is the iteration, and d is a vector, same size as x, called the descent vector. Then, this is what the algorithm looks like:

x = x_init
while ||gradient(f(x))|| > epsilon:
    x = x + d

That’s it! We keep doing the update until the norm of the gradient is small enough (as it should reach a zero value at some extremum).

We will see 3 different descent/direction vectors: Newton’s direction, Gradient’s direction, and Gradient + Optimal Step Size direction. First, we need to define a function that we will try to minimise during our experiments.

Rosenbrock Function

I chose the Rosenbrock function, but you may find many others, here for instance. Another good one would be Himmelblau’s function.

Rosenbrock function — Wikipedia

It has a global minimum at (x, y)=(a, a²) where f(a, a²) = 0. I will use a=1, b=100 which are commonly used values.

We will also need, two other pieces of information, the gradient of that function, as well as the hessian matrix.

Gradient of the Rosenbrock function

Hessian of the Rosenbrock function

Let’s open up a file and start a Python script. I will do this in a Google Colab, and all the code used in this post will be available here:

Google Colaboratory

Here is our first piece of code.

https://medium.com/media/edbb55a0d360b99317e39fa0ad242157/href

From now on, I will refer to the function input vector as x, akin to the problem definition earlier. Now that we are ready, let’s see the first descent vector!

Newton’s direction

Newton’s direction is the following:

Newton’s direction

So the update is:

Quick note n°1

You can find this updated formula by doing the 2nd order Taylor expansion of f(x + d), since the update we are performing is x_new = x + d.

We want to find d such that f(x + d) is as low as possible. Supposing f’’(x) is positive, this equation is a parabola that has a minimum. That minimum is reached when the derivative of f(x + d) is zero.

In n-dimensions, f’’(x) becomes the hessian matrix, and 1/f’’(x) shows up as the inverse hessian matrix. Finally, f’(x) will be the gradient.

Quick note n°2

We need to compute the inverse of the hessian matrix. For big matrices, this is a very computationally intensive task. Therefore, in practice, we solve this a bit differently, but in a totally equivalent manner.

linear equation

Instead of computing the inverse of the hessian matrix, we solve this equation for g and make the update rule the following:

Let’s code the algorithm now:

https://medium.com/media/c383b586bf7aa336642114cd85d0719d/href

You will notice a small difference with the algorithm I presented at the beginning. I added a max_iteration parameter, so that the algorithm doesn’t run indefinitely if it doesn’t converge. Let’s try it.

https://medium.com/media/978e1da5b2c877030ffd3caa79e7e21e/href

We get this result:

x* = [1. 1.]
Rosenbrock(x*) = 0.0
Grad Rosenbrock(x*) = [0. 0.]
Iterations = 2

The algorithm converged in only 2 iterations! That’s really fast. You might think:

Hey, the initial x is very close to the target x*, that makes the task easy!

You’re right. Try with some other values, for instance x_init = [50, -30], the algorithm terminates in 5 iterations.

This algorithm is called the Newton’s Method and all descent algorithms are modifications of this method! It’s kind of the mother formula. The reason why it’s really fast is that it uses second order information (the hessian matrix).

Using the hessian matrix, however, comes at a cost: efficiency. Computing an inverse matrix is a computationally intensive task, so mathematicians came up with solutions to overcome this problem. Mainly: Quasi-Newton methods, and Gradient methods. Quasi-Newton methods try to approximate the inverse of the hessian matrix with various techniques, whereas Gradient methods simply stick to first order information.

Gradient’s direction

If you did some Machine Learning, you’ve probably seen this already. The gradient direction:

Gradient’s direction

Where α is called the step size (or learning rate in ML), and is a real number.

If you have been doing some Machine Learning, now you know this formula is actually part of a bigger one: Newton’s direction, except we replaced the inverse hessian with a constant! The update rule now is:

The algorithm becomes:

https://medium.com/media/143d976399c461e59d96d9830d658e1e/href

Let’s try it out:

https://medium.com/media/ff4d2849606ee0560747d3ae65b5dd70/href

You can tweak the values of alpha, epsilon, and max_iterations. In order to get a result similar to the Newton’s method I came up with those. This is the result:

x* = [0.99440769 0.98882419]
Rosenbrock(x*) = 3.132439308613923e-05
Grad Rosenbrock(x*) = [-0.00225342 -0.00449072]
Iterations = 5000

Wow! Gradient descent took 5000 iterations where the Newton’s method took only 2! Moreover, the algorithm didn’t completely reach the minimum point (1, 1).

The main reason for which this algorithm converged so slowly compared to Newton, is that not only we no longer have the information given by the second derivative of f, but we used a constant to replace the inverse hessian.

Think about it. The derivative of a function is the rate of change of that function. So the hessian gives information about the rate of change of the gradient. Since finding the minimum implies necessarily a zero gradient, the hessian becomes super useful as it tells you when the gradient goes up or down.

Many papers in ML are just about finding a better approach for this specific step. Momentum, Adagrad, or Adadelta are some examples.

Gradient’s direction + Optimal step size

One improvement to the classical gradient descent is to use a variable step size at each iteration, not a constant. Not only it’s going to be a variable step size, but it’s also the best possible step size.

αk is the step size at iteration k

The update is:

How do we find α? Since we want this update to be as efficient as possible, i.e. to minimise f as much as possible, we are looking for α such that:

Notice that at this step, x and grad(x) are constants. Therefore, we can define a new function q:

Where q is actually a function of one variable. And we want to find the α that minimises this function. Umm… Gradient descent? We could, but while we’re at it, let’s learn a new method: Golden Section Search.

Golden Section Search aims at finding the extremum (minimum or maximum) of a function inside a specified interval. Since we use α in the range [0, 1], this is the perfect opportunity to use this algorithm.

Golden Section Search — first 5 iterations

As this post is starting to be pretty long I’m not going to go into the details. Hopefully, with the help of that magnificent GIF I took ages to make, and the code below, you’ll be able to understand what’s happening here.

https://medium.com/media/eb7157eff57809fc419a0c7a4decfa90/href

Now that we are able to find the best α, let’s code gradient descent with optimal step size!

https://medium.com/media/cba14462984ff35daf362f43be2a34b9/href

Then, we can run this code:

https://medium.com/media/dbb893eeb41252d2aade76ae9dae1689/href

We get the following result:

x* = [0.99438271 0.98879563]
Rosenbrock(x*) = 3.155407544747055e-05
Grad Rosenbrock(x*) = [-0.01069628 -0.00027067]
Iterations = 3000

Even though in this case the results are not significantly better than pure gradient descent, generally the optimal step size performs better. For instance, I tried the same comparison with Himmelblau’s function, and gradient descent with optimal step size was more than twice as fast as pure gradient descent.

Conclusion

This is the end of this post. I hope you learned some new things that triggered your curiosity for mathematical optimization! There are tons of other interesting methods. Go find them! Don’t forget to check out the Google Colab file, you will find all the code used and the same tests we did here with Himmelblau’s function. Don’t hesitate to leave a comment, and until next time, peace! :)

Mathématiques des réseaux de neurones — code Python

Omar Aflak — Thu, 21 Feb 2019 17:23:54 GMT

Réseaux de neurones en partant de zéro en Python

Photo by Mathew Schwartz on Unsplash

Cet article est une traduction du poste originalement publié ici.

UPDATE: Check out the updated version of this article at: https://omaraflak.com/articles/neural-network, https://omaraflak.com/articles/neural-network-2

Le but de cet article est de comprendre comment est implémenté un framework tel que Keras, mais également de comprendre les fondements mathématiques qui se cachent derrière le machine learning. Nous allons donc créer en partant de zéro, une mini bibliothèque qui nous permettra de construire des réseaux de neurones très facilement, comme ci dessous:

3-layer neural network

Je supposerai à partir d’ici que vous avez déjà quelques connaissances fondamentales telles que le modèle du neurone artificiel et l’algorithme des gradients descendants. Encore une fois, le but ici n’est pas d’expliquer les différentes applications possibles du machine learning, mais plutôt comment implémenter ces algorithmes.

Couche par Couche

Gardons à l’esprit la démarche globale du machine learning :

Donner une entrée au modèle.
Propager cette entrée à travers le réseau de neurones jusqu’à récupérer la sortie.
Une fois la sortie obtenue, nous pouvons la comparer à la sortie voulue et donc calculer une erreur.
On ajuste les paramètres du modèle pour diminuer l’erreur précédemment calculée. Pour cela on soustrait à chaque paramètre la dérivée de l’erreur par rapport à lui-même (gradient descendant).
On recommence à l’étape 1.

L’étape la plus importante est la 4ième. Nous voulons être capable de créer autant de couches que l’on veut, de n’importe quel type, et d’utiliser n’importe quelle fonction d’activation. Seulement, en changeant l’architecture du réseau de neurones, on change également la formule littérale du calcul de la dérivée de l’erreur par rapport aux paramètres.

Le but est donc de faire une implémentation qui fait abstraction de l’architecture du modèle (comme dans Keras). Pour cela, nous devons implémenter chaque couche séparément.

Ce que chaque couche doit faire

Quelle que soit la couche que nous codons (fully connected, convolutional, maxpooling, dropout, etc.), il y aura toujours au moins deux éléments fondamentaux : une entrée et une sortie.

Passe avant — forward propagation

Nous pouvons dès lors préciser une propriété importante : la sortie d’une couche est l’entrée de la couche suivante.

Cette partie est ce qu’on appelle la passe avant (forward propagation) : on propage l’entrée X (image, son, texte, etc.) dans le réseau de neurones jusqu’à obtenir la sortie Y. Puis, on observe une erreur E qu’il faut maintenant diminuer.

Descente de Gradient

Ceci est un rappel rapide, si vous avez besoin d’en savoir plus sur la descente de gradient il y a des tonnes de ressources sur Internet.

Fondamentalement, nous voulons changer un paramètre dans le réseau (appelez-le w) afin que l’erreur totale E diminue. Il existe un moyen intelligent de le faire (sans changer le paramètre au hasard) qui est le suivant:

Où α est un paramètre dans l’intervalle [0,1] que nous fixons et qui est appelé le taux d’apprentissage. Quoi qu’il en soit, l’important ici est ∂E/∂w (la dérivée de E par rapport à w). Nous devons être en mesure de trouver la valeur de cette expression pour n’importe quel paramètre du réseau, quelle que soit son architecture.

Passe arrière — backward propagation

Supposons que l’on donne à une couche la dérivée de l’erreur par rapport à sa sortie (∂E/∂Y), alors elle doit être capable de donner la dérivée de l’erreur par rapport à son entrée (∂E/∂X).

L’erreur E est un scalaire (un nombre), et X et Y sont des matrices. La notation ci-dessus (abusive) signifie ceci:

Laissons de côté ∂E/∂X pour l’instant et concentrons-nous sur ∂E/∂Y. Si une couche a accès à ∂E/∂Y où Y est sa propre sortie, alors nous pouvons très facilement calculer la dérivée de l’erreur par rapport à ses paramètres ∂E/∂W (pour l’étape de l’ajustement), et cela, indépendamment de l’architecture globale du réseau de neurones! Il suffit d’utiliser la règle de dérivation des fonctions composées :

Etant donnée ∂E/∂Y nous pouvons donc calculer ∂E/∂W, et donc ajuster les paramètres de la couche!

Pourquoi avons-nous besoin de ∂E/∂X ?

N’oubliez pas, la sortie d’une couche est l’entrée de la couche suivante. Donc ∂E/∂X pour une couche sera ∂E/∂Y pour la couche précédente! Une fois munit de son propre ∂E/∂Y, la couche précédente pourra à son tour ajuster ses paramètres. Pour calculer ∂E/∂X on utilise encore une fois la règle de dérivation des fonctions composées :

Cette astuce est la clé de compréhension de la backward propagation! Apres cela, nous pourrons programmer un réseau de neurones convolutif en un rien de temps.

Un super diagramme

C’est ce que j’ai décrit plus tôt. La couche 3 va mettre à jour ses paramètres en utilisant ∂E/∂Y, puis va passer ∂E/∂H2 à la couche précédente, qui est son propre “∂E/∂Y”. La couche 2 va alors faire de même, et ainsi de suite.

Ça peut paraître abstrait maintenant mais deviendra très clair part la suite. Nous pouvons dès lors créer notre première classe en Python qui sera une classe abstraite représentant une couche.

Classe abstraite : Layer

La classe abstraite Layer, dont les autres couches hériteront, contient les caractéristiques communes à toutes les couches : une entrée, une sortie, une fonction qui fait la passe avant, et une pour la passe arrière.

https://medium.com/media/a1d792797eafb36d7e3c488d2389c266/href

Comme vous pouvez le constater, il existe un paramètre supplémentaire dans backward_propagation que je n’ai pas mentionné, c’est le learning_rate. Ce paramètre devrait être une politique de mise à jour, ou un optimiseur, comme ils l’appellent dans Keras, mais par souci de simplicité, nous allons simplement passer le learning rate et mettre à jour nos paramètres en utilisant la descente de gradient.

Fully Connected Layer

Définissons et implémentons maintenant le premier type de couche: Fully Connected Layer ou FC Layer. Les couches FC sont les couches les plus élémentaires car tous les neurones d’entrée sont connectés à tous les neurones de sortie.

Forward Propagation

La valeur de chaque neurone de sortie peut être calculée comme suit :

La notation matricielle permet de simplifier ce calcul :

Nous en avons fini avec la passe avant. Traitons maintenant la passe arrière de la couche FC.

Notez que je n’utilise aucune fonction d’activation, c’est parce que nous allons l’implémenter dans une couche à part !

Backward Propagation

Comme nous l’avons dit, supposons que nous ayons une matrice contenant la dérivée de l’erreur par rapport à la sortie de cette couche (∂E/∂Y). Nous avons besoin de :

La dérivée de l’erreur par rapport aux paramètres (∂E/∂W, ∂E/∂B)
La dérivée de l’erreur par rapport à l’entrée (∂E/∂X)

Commençons par ∂E/∂W. Cette matrice doit avoir la même taille que W : ixj où i est le nombre de neurones d’entrée et j le nombre de neurones de sortie. Nous avons besoin d’une dérivée pour chaque paramètre :

En utilisant la règle de dérivation des fonctions composées, on écrit :

D’où,

Nous avons notre première formule qui permet d’ajuster les poids ! Calculons à présent ∂E/∂B.

Encore une fois, ∂E/∂B doit-être de la même dimension que B: une dérivée par biais.

D’où,

Maintenant que nous avons ∂E/∂W et ∂E/∂B nous pouvons ajuster tout les paramètres de cette couche de sorte à diminuer l’erreur! Il nous reste simplement à calculer ∂E/∂X pour que la couche précédente puisse faire les mêmes calculs.

Encore une fois, dérivation de fonctions composées :

Nous pouvons écrire la matrice entière :

Et voilà! Nous avons obtenu ces trois formules fondamentale pour la couche FC!

Coder la couche FC

Créez un nouveau fichier qui contiendra le code suivant :

https://medium.com/media/e0e90a001853dbd5dca31a5dd1af4995/href

Couche d’activation

Tous les calculs que nous avons faits jusqu’à présent étaient complètement linéaires. La machine n’apprendra rien avec ce genre de modèle. Nous devons ajouter une non-linéarité au modèle en appliquant des fonctions non linéaires à la sortie de certaines couches.

Nous devons maintenant refaire tout le processus pour ce nouveau type de couche !

https://medium.com/media/a32341bb891e1a18b9b74288b4c9bb60/href

Pas de soucis, ça va être bien plus rapide car il n’y a pas de paramètres entraînable. Il suffit de calculer ∂E/∂X.

Nous appellerons f et f' la fonction d’activation et sa dérivée, respectivement.

Forward Propagation

Comme vous le verrez, c’est assez simple. Pour une entrée X donnée, la sortie est simplement la fonction d’activation appliquée à chaque élément de X. Ce qui signifie que l’entrée et la sortie ont la même dimension.

Backward Propagation

Étant donné ∂E/∂Y, nous voulons calculer ∂E/∂X.

Attention, nous utilisons ici une multiplication scalaire (élément par élément) entre les deux matrices.

Coder la couche d’activation

Le code pour la couche d’activation est aussi très simple.

https://medium.com/media/b85efa76afbd9c312afa1e44ec204d97/href

Vous pouvez également écrire certaines fonctions d’activation et leurs dérivées dans un fichier séparé. Elles seront utilisées plus tard pour créer une couche d’activation.

https://medium.com/media/f15a23c30b9c0dd1f5da7e168dfa4097/href

Fonction de perte

Jusqu’à présent, pour une couche donnée, nous supposions que ∂E/∂Y était donné (par la couche suivante). Mais qu’advient-il de la dernière couche? Comment obtient-elle ∂E/∂Y? Nous le donnons simplement manuellement, et cela dépend de la façon dont nous définissons l’erreur.

L’erreur du réseau, qui mesure le degré de performance du modèle pour une entrée donnée, est définie par nous-même. Il existe de nombreuses façons de définir l’erreur, et l’une des plus connues est appelée MSE — Mean Squared Error.

Mean Squared Error

Où y* et y désignent respectivement la sortie souhaitée et la sortie obtenue. Vous pouvez penser à la perte comme une dernière couche qui regroupe tous les neurones de sortie et les écrases en un seul neurone. Ce dont nous avons besoin maintenant, comme pour toutes les autres couches, c’est de définir ∂E/∂Y. Excepté que maintenant, nous avons enfin “atteint” E!

Et voilà! Il suffira de donner cette valeur à la dernière couche lors de la passe arrière, ce qui lui permettra d’ajuster ses paramètres, puis elle calculera ∂E/∂X qu’elle passera à la couche d’avant, qui fera le même procédé à son tour, etc.

Nous pouvons implémenter la fonction de perte dans un fichier séparé. Elle sera utilisée lors de la création du réseau.

https://medium.com/media/8faa28e1cf1d7b073b7011bf20a4038a/href

Classe Network

Bientôt fini, tenez bon ! Maintenant que nous avons tous nos bloque de code prêt à l’emploi, nous allons faire une classe appelée Network qui permettra de construire ces fameux réseaux de neurones !

J’ai commenté presque chaque partie du code, il ne devrait pas être trop compliqué à comprendre si vous avez saisi les étapes précédentes. Néanmoins, laissez un commentaire si vous avez des questions, je répondrai avec plaisir !

https://medium.com/media/0edee416f27e38da88987d223f6ad69d/href

Construire un réseau de neurone

Enfin ! Nous pouvons utiliser notre classe pour créer un réseau de neurones avec autant de couches que nous voulons ! Nous allons construire deux réseaux de neurones: un simple XOR et un solveur MNIST.

Résoudre XOR

Commencer par un XOR est toujours important car c’est un moyen simple de savoir si le réseau apprend quelque chose.

https://medium.com/media/e2cd57305cf937f6c415d310afb2228c/href

Je ne pense pas avoir besoin d’insister sur beaucoup de choses. Faites attention avec les données d’entrainement, vous devriez toujours avoir la dimension de l’échantillon en premier. Par exemple, avec le problème xor, la dimension des données d’entrées devrait être (4,1,2).

Résultat

$ python xor.py 
epoch 1/1000 error=0.322980
epoch 2/1000 error=0.311174
epoch 3/1000 error=0.307195
...
epoch 998/1000 error=0.000243
epoch 999/1000 error=0.000242
epoch 1000/1000 error=0.000242

[
    array([[ 0.00077435]]),
    array([[ 0.97760742]]),
    array([[ 0.97847793]]),
    array([[-0.00131305]])
]

Clairement ça fonctionne ! Nous pouvons maintenant résoudre quelque chose de plus intéressant : le dataset MNIST.

Résoudre MNIST

https://medium.com/media/5c680bc3b49d6ae9926284d8e49709c6/href

Résultat

$ python example_mnist_fc.py
epoch 1/30   error=0.238658
epoch 2/30   error=0.093187
epoch 3/30   error=0.073039
...
epoch 28/30   error=0.011636
epoch 29/30   error=0.011306
epoch 30/30   error=0.010901

predicted values : 
[
    array([[ 0.119,  0.084 , -0.081,  0.084, -0.068, 0.011,  0.057,  0.976, -0.042, -0.0462]]),
    array([[ 0.071,  0.211,  0.501 ,  0.058, -0.020, 0.175,  0.057 ,  0.037,  0.020,  0.107]]),
    array([[ 1.197e-01,  8.794e-01, -4.410e-04, 4.407e-02, -4.213e-02,  5.300e-02, 5.581e-02,  8.255e-02, -1.182e-01, 9.888e-02]])
]
true values : 
[[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]

Ça fonctionne parfaitement! Incroyable :)

https://medium.com/media/71cc1dd5f467e8e68938f7dffa85b7f3/href

GitHub

Vous pouvez trouver le code complet utilisé pour cette publication dans le repo GitHub suivant. Il contient également le code pour d’autres couches comme Convolutives ou bien Flatten.

GitHub - omaraflak/Medium-Python-Neural-Network: This code is part of my post on Medium.

https://medium.com/media/7ee24b0358ff07ee6f03e97e6133fe5e/href

Si vous avez aimé cet article — Quelques claps 👏 m’aideraient beaucoup. Peace! 😎

Mathématiques des réseaux de neurones — code Python was originally published in France School of AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

A road towards happiness

Omar Aflak — Tue, 01 Jan 2019 12:26:20 GMT

If you’re reading this at the time it is published then happy new year to you ! 🎉🥂 I thought I would publish something a bit different from the usual stuff, so let’s talk about happiness :)

Last year, I stumbled upon an old Instagram story from Will Smith where he talks about happiness. He said that your happiness is your responsibility. I found this idea interesting and I started thinking about it.

Photo by Rob Schreckhise on Unsplash

Your Happiness

Regardless of the time or place you live in, regardless of your gender, regardless of your age, your religion, and beliefs, no matter who you are, the end goal of every action you ever take is your happiness.

As humans living in the 21st century, we get happy or fulfilled through two major aspects of our lives. Snapchat and Instagram. Just kidding… Personal life (family, friends, partners…) and professional life (work).

Unfortunately, rare are the persons who are fulfilled by these two aspects simultaneously, and the truth is you really need both of them to be happy. Since we spend most of our time at work, it seems reasonable to assume that we need to be fulfilled by our professional life; having a good personal life is not enough. In this post I’d like to focus on the work aspect.

Why some people get fulfilled by their work ?

Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do. — Steve Jobs

Passion, gives us a sense of purpose, a mission, a reason to never settle and to always improve ourselves. Passion unlocks our potential, it gives us strength and makes us proud of who we are becoming, satisfied, because we are actually moving forward.

Passion comes within the right environment

While some people can have predispositions to be better at something and hence like it, passion can only be brought to life within the right environment : what surrounds us in our everyday life. As we grow up, as children or students, our environment is essentially composed from family and friends, and schools. For now let’s focus on family.

Why family is so important in the process of finding your passion ?

I like to think of it, like Simon Sinek thinks of employees in a company.

If we trust each other we will turn our backs, we will take risks, we will innovate, we will do things that will change the course of our world. If I don’t trust you, I can’t do that. — Simon Sinek

In this talk, he is saying that leaders should expand the “circle of trust” of their companies from the high hierarchy employees to the most junior ones, so that everyone feel like they belong to the company and use their time and energy to actually innovate and be productive, not protecting themselves.

It is essentially the same thing in families.

I had the chance to grow in a family where for example, of course it doesn’t simply boil down to that, “my money is your money, and your money is my money”. I realized later looking at friends around me that it isn’t always the case (something more like “my money is my money, and your money is your money”).

While I had the time to keep practicing and learning the things I love (programming), some of my friends had to figure out a way to “repay” their parents for money they’ve spent. They had to find irrelevant jobs to get money while I was investing time in my own future, practicing.

Today, because of all the time I’ve spent learning and improving myself in the areas I like, I can easily find internships when I need to, and I have a lot of experience in programming. But all this wouldn’t have been possible, if the environment I grew up in wasn’t suited for that. Of course it is not simply about money, my parents taught me to always be eager to discover new things, and supported me no matter what.

A side note about money…

Money isn’t important, you can always manage to get some. You spend it today, you’ll make more tomorrow. Whereas if you spend your time, you’ll never get it back; and this time is even more valuable when you’re young, don’t waste it.

So if you’re a parent and you want the best thing for your kids, teach them to enjoy learning and allow them to use their time. The more time they have, the more they can discover new fields, the more likely they are to find something they love.

Passion comes within the right environment.

What about schools ?

As I said earlier, as we grow up, the two major components of our environment are family and schools.

Since, as children (or students) we spend most of our time in schools (just like work when we grow up), we could easily argue that they are at least as important as families in the process of education. That’s why we refer to schools as education ! Because schools are not simply supposed to give knowledge, they’re supposed to educate !

Therefore, and this is actually why I’ve been writing this whole post, I would like you to think about that :