Stories by Tigran Khachatryan on Medium

Coherence Without Comprehension

Tigran Khachatryan — Thu, 17 Jul 2025 13:31:45 GMT

The Trap of Large Language Models

Introduction

In Foucault’s Pendulum (1988), Umberto Eco constructs a compelling narrative in which three editors from Milan — Casaubon, Belbo, and Diotallevi — become deeply entangled in a conspiracy theory of their own invention named “The Plan.” Central trigger of this descent is Abulafia, a computer they use to generate random connections between disparate historical texts. Initially a playful tool, Abulafia’s output becomes a pseudo-oracle, spitting out increasingly elaborate associations between the Templars, Rosicrucians, and the occult. Though the editors know the machine merely recombines texts, they eventually begin to believe the texts it generates. “The Plan” conspiracy gradually spirals into obsession, drawing real-world consequences, including paranoia, delusion, and ultimately, murder — when Belbo is killed by fanatics who have come to believe in the very conspiracy the editors created.

Written in 1988 Abulafia was of course not a direct reference to today’s “intelligent machines”. The machine was described as rather deterministic and it simply combined the fragments of human-authored text into something that appeared coherent. Nevertheless, in its combinatorial mimicry of knowledge, Abulafia eerily prefigures the behavior of modern large language Models. These Models are trained on socially constructed language about the world. They produce what appears to be knowledge by remixing what humans have said, thought, and imagined. The results often dazzle with plausibility, but the boundary between signal and noise is very thin. LLMs hallucinate — fabricate plausible-sounding but false information. Like Abulafia, they are engines of synthesis without referential grounding.

Coherence Without Comprehension

In the recent years there has been a large body of empirical research to systematically quantify how often large language models hallucinate, and the results are quite intriguing. For example, in DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation authors, introduced a benchmark of over 75,000 question-answer pairs spanning eight factual domains — ranging from Nobel Prize winners and census data to university rankings and math. The dataset was explicitly designed to elicit definitive answers to concrete questions involving dates, names, locations, or numeric values, with each question paraphrased 15 times to test for response consistency. 3 metrics were measured :

Factual Accuracy (Does the model provide the correct answer?)
Prompt misalignment (Does the model follow the instruction?)
Response Consistency (Does it answer the same question reliably across paraphrases?)

Factual hallucination rates (FCH rate in the charts) ranged from 59% to 82% across models, with particularly severe failure modes in domains requiring numerical precision (e.g., census statistics, math problems, and university rankings). Prompt misalignment — where a model strayed from the question’s format or intent — occurred in 6% to 95% of outputs, depending on the model. Even consistency was lacking: models often gave contradictory answers to paraphrased versions of the same question, with consistency scores ranging from 21% to 63%. Notably, high-profile models performed better than open-source alternatives but still exhibited non-trivial hallucination rates, especially when forced to recall precise facts.

These findings underscore a critical point: LLMs do not merely occasionally fabricate information — they do so consistently at rates that, in many contexts, would be completely unacceptable for institutional knowledge systems.

Naturally, what makes this more than a technical problem is that human knowledge systems themselves are not neutral or stable. In The Archaeology of Knowledge Foucault argued that what counts as knowledge is shaped not by timeless facts but by historical conditions and institutional power structures. There is no clean division between language and belief, between discourse and truth. Today, LLMs are trained on the sedimented layers of human discourse: journalism, blogs, scientific papers, Reddit threads, fan fiction. These are not mere traces of reality but realities of their own often shaped by ideology and bias. When a model generates a text, it is not generating a neutral representation — it is sampling from a contested, constructed archive of messy human discourse.

Confusing the discourse with the knowledge and understanding is precisely the epistemological trap that modern LLM creators fall into. For example, Ilya Sutskever, co-founder of OpenAI, highlighted in an interview:

“On the surface, it may look like we are just learning statistical correlations in text. But it turns out that to compress them really well, what the neural network learns is some representation of the process that produced the text. This text is actually a projection of the world; there is a world out there and it has a projection on this text.”

This view suggests that effective compression — the ability to predict sequences well — requires models to internalize something akin to a latent world model. Basically implying that neural networks, by learning from vast amounts of language, implicitly reconstruct the causal or generative processes underlying human experience.

However, the problematic nature of working with such projections of the world was already beautifully illustrated over 2,000 years ago in Plato’s allegory of the cave. In the allegory, Plato describes prisoners who have spent their entire lives chained in darkness, forced to watch shadows cast on a wall by objects they cannot see.

To them, these flickering shadows are not mere projections — they are reality itself. They speak of them, reason about them, even build theories around them, all without ever gaining knowledge about the real forms that cast them. Large language models are prisoners of the cave par excellence and there is a growing body of empirical evidence to prove it.

In their 2025 study “What Has a Foundation Model Found?”, Vafa et al. (Harvard & MIT) challenge precisely this premise. They ask a direct question: does good predictive performance imply the acquisition of an underlying world model? Using a methodology they call inductive bias probing, the authors evaluate whether foundation models trained on structured data — such as orbital trajectories governed by Newtonian physics — internalize the correct physical laws.

And it turns out that even when models accurately predict planetary orbits, they fail to exhibit any inductive bias toward Newtonian mechanics. Instead, they learn task-specific heuristics that mimic the data but do not generalize. When prompted to extrapolate on related tasks, these models hallucinate inconsistent “force laws,” revealing no coherent internalization of gravitational principles. In other domains — such as lattice problems or games — similar patterns emerge: models latch onto surface-level legal moves or token regularities rather than deeper structures.

As Vafa et al. put it, foundation models can “excel at their training tasks yet fail to develop inductive biases towards the underlying world model.” This undermines the notion that sequence modeling alone — even at scale — is sufficient for capturing latent, causal structures in the world. Akin to Plato’s cave allegory, Large Language Models are aware only of the shadows — the linguistic traces human have left behind. Their understanding, is of surfaces and silhouettes: human experiences, flattened into tokens.

Why should we care?

Unlike Abulafia — which generated random associations, had no agency of its own, and a no actuation point in the physical world — large language models today are ubiquitous. They are proprietary, privately owned and embedded in customer service chatbots, legal tools, creative platforms, educational apps, and decision-making systems in hiring, healthcare, and governance.

These are millions of hallucinating Abulafias distributed across the globe, each generating persuasive language, each capable of nudging belief or shaping action. Their outputs are often treated not as probabilistic guesses, but as authoritative statements, while the users often have little or no understanding of how these models work or what assumptions underlie their training. Hence, the experiment is no longer confined to linguistic probabilities in a lab; these systems operate at scale, act in real-time, and influence real decisions. To undermine this potential danger, it is often argued that LLMs lack physical modes of actuation — and therefore lack real capacity to cause harm in the physical world. However, the danger is far from hypothetical. Chatbots have contributed to suicides, disinformation cascades, misdiagnosed mental health crises, and prompted dangerous behavior. LLM’s don’t need a body to affect the physical world, they can borrow yours.

If a book can shape collective consciousness, inspire religions, rewire neural pathways, and shift belief systems — then so can a text in a chatbot. After all, both are just language on a surface.

There is of course a crucial difference: human written text is part of discourse; chatbots are immune to it. A text in a book or a paper enters the public domain of critique and scrutiny. We can all read the exact same piece, argue over its meaning, interpret it in different contexts, and refine our understanding through shared deliberation. But with LLMs, there is no stable artifact. My chatbot and your chatbot are not the same. Each prompt generates a private, non-repeatable instance. There is no canonical base to analyze or a shared object to dispute. This creates an rupture. We are outsourcing belief-formation to these systems while we cannot interrogate the model’s priors, trace its citations, or argue with its intent. It is manufactured anew in every interaction. And yet, while each interaction is private and fragmented, the system behind them is anything but. These models are not decentralized independent instances whispering into the ears of billions. They are centralized systems, built, owned, and operated by a handful of corporate actors, each controlling the system prompts and outputs. What appears to be a personal tool is, in reality, a globally scaled instrument — one capable of producing the same belief-framing narrative across vast populations, quietly aligning thought and behavior at scale.

A chatbot just like book cannot act on its own, but both can persuade and human actions will follow. The LLM does not need a physical body — its body is distributed across its user base made of the humans who act on its outputs. LLM’s dont need bodies, when they can animate yours thought click, shares, investments, hires, votes, and self-diagnosis. A very gently nudge in models system prompt or training data can cause millions of purchases, alter facts and trigger revolutions. And like the editors in Foucault’s Pendulum, users are increasingly seduced by patterns that seem increasingly more meaningful.

Conclusion

In the Eco’s novel, Abulafia was not dangerous because it fabricated texts — it was dangerous because it persuaded. It’s coherence was mistaken for truth and fluency for insight. The same can be said of today’s LLMs. They do not uncover reality; they reshuffle the semiotics with misplaced confidence. Their persuasiveness is not proof of understanding; it is the result of optimization and reinforcement training with humans in the loop.

In the novel, the game turns deadly for the authors of the conspiracy in our world, the stakes are even higher. The Abulafias are no longer tucked away in a dusty office — they live in our phones, our workflows, our institutions affecting decisions. We are all Belbo now, whispering questions into the machine yet we don’t know who owns “The Plan “ — or whether the Plan is already shaping us.

Much of this arguments may come across as doom and gloom, however, I do believe that large language models are promising tools — potentially transformative in a number of fields. While all innovation carries risks, I am convinced that the current trajectory demands more scrutiny. What is concerning is not necessarily the technology itself, but the structure of its development. It’s widespread accessibility paired with the immense resource intensity required to train cutting-edge models — both in compute and talent. A handful of companies control the knowledge, the infrastructure, the models, and increasingly, the discourse around AI itself. This centralization concentrates the intellectual direction of the field. Too few brains, in too few rooms, are making decisions that have the potential to affect too many.

About the Author

I am a curious Data Scientist with a strong passion for finding and understanding patterns. My interests include Math, Computer Science, Philosophy, Architecture & Urbanism. You can connect with me on LinkedIn and Github.

References

Allegory of the cave - Wikipedia

https://en.wikipedia.org/wiki/Foucault%27s_Pendulum

https://arxiv.org/abs/2406.09155?

https://arxiv.org/abs/2507.06952

Coherence Without Comprehension was originally published in Data Science Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.

Finnish Startup Ecosystem

Tigran Khachatryan — Mon, 12 May 2025 12:16:42 GMT

A Statistical Analysis of the Finnish Startup Ecosystem in 2025

Intro

Innovation isn’t optional — it’s a structural necessity. As complex adaptive systems scale, they follow accelerating growth patterns punctuated by stagnation or collapse. These systems — biological, social, or economic — depend on timely shifts to reset their trajectory.

Graph of successive exponential growth trajectories, each potentially leading to a finite time singularity (denoted by the vertical dotted line) and subsequent collapse unless an innovation is made at a time prior to the singularity (denoted by the black dots), which resets the clock to start the entire cycle over again. From: Geoffrey West: Scale: The Universal Laws of Growth, Innovation, Sustainability, and the Pace of Life in Organisms, Cities, Economies, and Companies

The economy is one such system. It’s a vast, dynamic network of agents exchanging value — capital, labor, information, attention — constantly adapting to shifting incentives and constraints. When the dominant logic of production or consumption begins to ossify, it’s not incumbents but peripheral actors — experiments at the edge — that often drive the next wave of change. This is where startups come in. Structurally lean, adaptive, and often unconstrained by legacy processes, startups are uniquely positioned to explore uncertain frontiers. They are, in essence, probes into the future — testing out new models, technologies, and behaviors that can unlock the next growth trajectory.

Venture capital and startup financials are signals — VC reflects expectations; performance shows validation. The chart below tracks global VC funding by industry from 2000 to 2025.

VC Funding Worldwide by Industry Between 2000 and 2025.

Over the past several years, funding has not only exploded in volume but also in sectoral spread. Categories like fintech, healthtech, enterprise software, and energy have seen massive surges, often in response to breakthroughs or systemic stressors that exposed the limits of existing paradigms. This trajectory mirrors the cycles of disruption and renewal described in complex systems theory. When dominant industries begin to stagnate, capital seeks out the periphery — looking for the next structural reset. Startups are the vehicle through which these resets occur.

However, venture capital is also prone to hype cycles — funding can surge ahead of substance, driven by narratives rather than outcomes. To understand what actually delivers value, we need to look beyond expectation and into validation: startup financial performance. It helps to reveal which innovations are not just promising, but operationally viable and economically sustainable.

In this article, we focus on Finnish Startups Ecosystem by analyzing a sample of 757 startups listed on statup100 page. The data from startup100 is enriched with financial information by matching business IDs to publicly available financial reports.

Startups: Spatial Distribution

Let’s begin by examining the spatial distribution of startups across Finland to understand where entrepreneurial activity is concentrated.

As the chart above illustrates, startup formation is highly centralized, with over 62% of companies based in the Helsinki Metropolitan Area (Helsinki, Espoo, Vantaa). This reflects a broader pattern seen in innovation ecosystems globally: proximity to talent, funding, infrastructure, and community drives concentration. Beyond the capital region, Tampere emerges as the leading non-metropolitan hub, while cities like Turku, Oulu, and Jyväskylä show smaller but notable clusters of activity.

In the Helsinki Metropolitan Region, startup activity is concentrated around several key hubs, with the most notable being Helsinki city centre, the Maria 01 startup campus, and the Aalto University campus — each serving as a focal point for innovation, talent, and venture formation.

Startups: Business Domain

Next, we can look at the business domains in which startups operate.

The chart above illustrates an overwhelming concentration in IT Consulting and IT Services, which account for over 56% of all startups in the dataset. This indicates that the Finnish startup ecosystem is heavily skewed toward digital services, a pattern that mirrors broader global trends in tech-driven entrepreneurship. The second-largest category — Product Development, Research, and Design — accounts for less than 5%, revealing a steep drop-off beyond IT. This suggests that while Finland is fostering a vibrant digital services sector, diversification into deep tech, hardware, or industrial innovation remains comparatively limited.

Zooming in on the dominant IT category, we see a clear skew toward software-centric startups.

Nearly half (over 46%) of all companies fall under Software Design and Development, making it the single most common business line in the Finnish startup landscape. Other tech-adjacent areas like Computer Hardware and Software Consulting and Engineering R&D are present but far less prevalent, each representing around 3% of the total. This indicates that Finland’s startup ecosystem is not just tech-heavy — it is software-first. The dominance of software reflects global patterns but also raises questions about diversification, particularly into hardware, biotech, or industrial tech sectors that tend to have longer development cycles but potentially greater structural impact.

Looking at total employment by business category further reinforces the dominance of IT-related startups in Finland. IT Consulting and IT Services employ by far the largest number of people — over 12,000 — underscoring both their prevalence and operational scale. The next largest category, Mobile Phones and Accessories, lags far behind with fewer than 2,000 employees. Interestingly, operational categories such as Cleaning Services and Construction Aggregates also make the list, suggesting that while Finland’s startup narrative is tech-centric, there is still demand-driven entrepreneurship rooted in more traditional, service-heavy sectors. Overall, the employment distribution highlights how IT remains not just a preferred domain for startup formation but also a major source of job creation within the ecosystem.

Startups: Financial Performance

Note: The log scale excludes negative values, so loss-making companies are not visible in the chart.

The scatter plot reveals a clear positive correlation: as revenue increases, profitability tends to follow, especially among higher-revenue firms. Most startups cluster in the low-turnover, low-profit zone, which is expected for early-stage or niche ventures still building scale. However, a handful of outliers show both high turnover and profitability, likely representing mature or fast-scaling companies that have transitioned from experimentation to execution. The use of logarithmic axes highlights the wide variance in financial performance — from micro-startups generating thousands to scale-ups pushing toward the €1B mark — underscoring the diversity of operating models within the ecosystem.

Breaking turnover down by city reveals that the Helsinki Metropolitan Area (particularly Helsinki and Espoo) dominates in terms of both scale and turnover spread. These cities show the widest turnover ranges and the highest concentration of outliers, including startups reaching beyond €100M in annual revenue. This underscores their role as financial anchors in Finland’s innovation landscape, where capital, talent, and investor focus converge. Meanwhile, Tampere, Turku, and Oulu present more balanced but modest profiles, contributing meaningfully without matching the extremes of the capital region. The logarithmic X scale again highlights a core reality of startup economies: most create modest value, but a few account for an outsized share of the total.

Median Financials Over Time

When tracking median financial metrics over time, a clear tension emerges between revenue growth and profitability. Median turnover rose sharply from 2017, peaking between 2021 and 2023 at nearly €191K, before dipping in 2024 — likely due to incomplete reporting. However, despite this growth in revenue, median operating profit and net income have remained consistently negative since 2018, pointing to structural profitability challenges. The lowest point came in 2023, with median net income falling below –€60K, suggesting rising operational costs or burn. While 2024 shows a slight uptick across metrics, it’s likely skewed by early data. The overall picture is one of a scaling ecosystem in terms of revenue, but one where the average startup still struggles to convert growth into sustainable profit.

Financial Ratios

Financial ratios provide deeper insight into the structural health of Finnish startups. Liquidity metrics — both quick and current ratios — have remained consistently above 1.0, indicating that the median startup can cover short-term liabilities with its assets. A notable spike in 2020, with values nearing 2.0, likely reflects COVID-driven cash hoarding amid uncertainty. However, this short-term resilience stands in contrast to persistent solvency challenges: the median solvency ratio has hovered below 0.3, revealing a heavy reliance on external capital and limited equity buffers. This imbalance — strong liquidity, weak solvency — suggests that while startups are generally well-positioned to manage day-to-day operations, they often lack long-term financial independence. It paints a picture of an ecosystem that functions well in the short term but remains structurally dependent on venture funding and external financing to sustain growth.

Turnover per employee

Turnover per employee offers a lens into operational efficiency across different startup sizes.

As companies grow, median productivity per employee improves, peaking in the 101–500 employee range. This group shows the highest median revenue per head, suggesting that scaling often brings efficiency gains — likely due to better resource allocation, specialization, and process maturity. Interestingly, the 500+ segment displays a slight drop in median productivity, hinting at efficiency plateauing or rising organizational overhead in the largest firms. Smaller companies (1–50 employees) show extreme variance, with some generating high turnover per head — often SaaS or IP-driven models — while others operate with minimal revenue. Overall, the data illustrates that while growth can unlock productivity, scale is not a guarantee of efficiency, especially at the very top end.

Financials Across Sectors

Zooming in on 2023 (the year with most available data), we see how financial performance varies across sectors.

There’s a strong positive correlation between turnover and operating profit, especially visible among companies generating higher revenues. Sectors like IT Consulting, Product Development, and Finance dominate the upper-right corner, hosting firms with €1M+ in operating profit and turnover approaching €1B — clear indicators of maturity and scale. Applications and Software startups mostly fall into the mid-turnover range, yet many still post healthy margins, signaling efficient business models. The bubble sizes reflect employee count, and while larger firms often correlate with greater revenue and profit, the relationship isn’t uniform — highlighting how some lean startups still punch well above their weight. This visualization captures the broad spread in outcomes across Finland’s startup ecosystem — from micro-scale experiments to sector-leading revenue engines.

Top Companies

Zooming in on Finland’s top-performing startups — those in the top 20% by turnover and operating profit — reveals a variety of successful strategic models. At one extreme, HMD Global Oy dominates both in scale and profitability, with over €500M in turnover and multi-million euro profits. It’s a clear outlier and serves as a benchmark for what large-scale success looks like in the Finnish ecosystem.

Several companies, however, achieve success through efficiency rather than volume. Redhill Games Oy, Sensofusion Oy, and Youpret Oy demonstrate exceptionally strong profit margins despite more modest revenue bases, pointing to lean operations, high-value offerings, or both. These companies prove that profitability doesn’t require massive scale — just strategic clarity and execution.

Others, like Kamrock Oy, Sorter Oy, and Vertaa Ensin Suomi Oy, fall into a different category: high revenue, low profitability, likely driven by thin margins or heavy cost structures. This pattern underscores that growth without operational efficiency can be a liability, not an asset.

A third cluster — ResQ Club Oy, Smartum Oy, and Gobybike Finland Oy — strikes a strong balance between revenue and profitability. Their success often ties into sustainability, employee benefits, or social responsibility, aligning with evolving consumer and cultural values in Finland.

Startups: Leadership Network

Finally lets look at how decisions are made in Finnish startups.

This graph above visualizes the network of decision-makers across Finnish startups, where each node represents an individual and each edge indicates a shared leadership role within the same company(i.e. two individuals are connected if they both are listed as decision makers in a company)

The ecosystem’s leadership network is characterized by fragmentation with pockets of connectivity. Most decision-makers operate in isolation or in pairs — over 150 groups consist of exactly two individuals, typically reflecting co-founders or single-company boards.

While some larger highly interconnected clusters emerge, they often reflect companies with larger internal leadership boards.

However, quick topological analysis reveals a small number of dense clusters, where individuals hold roles across multiple companies. These high-degree nodes — visually prominent in the network graph — represent ecosystem connectors such as serial entrepreneurs, investors, or board professionals, who act as bridges between otherwise siloed startups. While rare, these interconnected leadership hubs play a disproportionate role in spreading knowledge, aligning strategic thinking, and reinforcing institutional memory across ventures. The structure is thus heavy-tailed: the vast majority of governance ties are isolated, but a select few nodes wield broad influence, shaping the trajectory of multiple companies in parallel.

Conclusion

The Finnish startup ecosystem presents a compelling case of focused specialization and distributed experimentation. Anchored by the Helsinki Metropolitan Region — with dense clusters around Maria 01, Aalto University, and the city centre — Finland’s innovation landscape is geographically centralized but organizationally diverse. Finland’s most successful startups are not monolithic. Some win on scale, others on efficiency or values — but all represent viable archetypes within an ecosystem increasingly shaped by economic pressure, cultural relevance, and operational discipline.

Sectionally, the ecosystem is dominated by IT and software, both in company count and employment, reflecting global demand for digital solutions and the nation’s technical talent base.

Financially, the ecosystem reveals a scaling revenue base but persistent challenges in profitability. Most startups operate with modest turnover and negative margins. The spread of financial outcomes — from micro-startups to multi-million-euro scale-ups — underscores the broad gradient of maturity within the ecosystem. Liquidity remains strong, but solvency remains a weak point, reflecting dependency on external capital and the early-stage nature of many firms.

On the human side, the leadership network shows a fragmented structure with a few key connectors. While most decision-makers are siloed within individual companies, a small number bridge startups, spreading knowledge and influence across the system.

Taken together, Finland’s startup landscape is lean, digitally focused, and increasingly sophisticated. Its strengths lie in technical depth, talent density, and a growing base of operationally sound ventures. To evolve into a more self-sustaining innovation economy, the next frontier will be deepening sectoral diversification, strengthening financial independence, and fostering more connective leadership across ventures.

Sources

The code & the data:

GitHub - Geometrein/startups: This project explores the Finnish startup ecosystem.

The interactive visualizations:

marimo

Other sources:

Browse startups - Startup100

Dealroom

How to Rick Roll Like a Data Scientist?

Tigran Khachatryan — Tue, 20 Aug 2024 07:01:44 GMT

A silly use of solid body trajectoids

A trajectoid. Image by the Author

Rick Rolling —a legendary internet phenomenon that has transcended mere meme status and infiltrated its way into various cultural domains including, media, technology, and even politics. It began as a simple bait-and-switch prank but has evolved into a cultural symbol that represents the playful and ironic nature of the internet.

Naturally, it’s time to introduce its data-driven variant targeted at Data Scientists by leveraging a fascinating concept from computational geometry: solid body trajectoids.

In this article, we will explore trajectoids, demonstrate their application using S&P 500 data, and create the ultimate data-driven Rick Roll that can literally be rolled.

What are Trajectoids?

Solid Body Trajectoids are 3D shapes specifically programmed to follow a predetermined path when rolled. For instance, if you draw a straight line on a ramp and roll a sphere down it, the sphere will naturally roll along that straight line. What if the path is more complex? Is there a shape that, when rolled, will follow this more intricate path? Moreover, does every arbitrary path have a corresponding 3d shape that will follow it precisely when rolled?

A happy trajectoid in its natural habitat rolling down a slope. Image by Nature

This exact question was explored by a team of researchers led by Professor Bartosz Grzybowski at the Institute for Basic Science in Korea and their findings reveal that these kinds of shapes exist for nearly all paths*. The researchers dubbed these shapes trajectoids and described the mathematical procedure necessary for generating such objects in Solid-body Trajectoids Shaped to Roll Along Desired Pathways paper. After publication, these findings turned out to be interesting from a geometric point of view and had far-reaching implications in quantum computing and robotics. Let’s get this ball rolling with some concrete examples to see what trajectoids actually look like!

S&P 500 Trejectoid

Let’s first explore how trajectoids work by using a familiar path as input. The S&P500 is a free-float weighted/capitalization-weighted stock market index that tracks the performance of 500 large companies listed on stock exchanges in the United States. The index often serves as a barometer for the dynamics of the U.S. economy (unrightfully) and often makes headlines for its sometimes erratic behaviour. We’ll extract the time series data from the S&P 500’s daily closing prices and generate a 3d shape that can roll along the time-series.

S&P500 composite index over the past year. Image by the author.

Since the daily prices tend to fluctuate quite a bit it will make the movements of the resulting trajectoid quite jerky. We can smooth out the line with a rolling average and make the rolling path of the trajectoid a little smoother (you can see the full data transformation steps in this repository) and use the processed time series to generate its trajectoid:

S&P500 2024 Trajectoid. Image by the Author.

While the shape of a trajectoid might seem arbitrary, it’s important to remember that the stock market, at its core, is the largest information-processing system created by humanity. It operates as a vast network of agents (rational or irrational, depending on your stance on the efficient market hypothesis ) constantly interacting with and reacting to an immense flow of new information. This trajectoid represents the collective responses of the U.S. stock market to information processed over the past year. The rise and fall of the S&P 500 guide the trajectoid’s rolling direction, embedding the market’s ongoing adjustments to the influx of data into its geometric shape.

Never Gonna Give You Up

Now that we’ve warmed up with some financial data, it’s time to turn Rick Astley’s “Never Gonna Give You Up” into a trajectoid. Let’s first see what the audio waveform for “Never Gonna Give You Up” looks like.

Rick Astley’s “Never Gonna Give You Up” waveform. Image by the author.

An audio waveform represents the variations in air pressure (sound) over time. It’s a direct visual representation of the song’s amplitude (volume) at each moment. Visualizing this waveform gives us an idea of the song’s structure — the peaks and valleys correspond to louder and softer parts of the track. However, while visually interesting, the raw waveform version of the audio is not ideal for creating our trajectoid. We need to simplify this data into a form that can be more easily manipulated.

To create a more usable path, we focus on the amplitude envelope of the waveform. The amplitude envelope captures the general shape of the waveform by tracking the peaks over time, giving us a smoother and more continuous path. This envelope more closely resembles a trajectory and by extracting it, we reduce the complexity of the waveform while preserving the overall shape and dynamic range of the audio.

Rick Astley’s “Never Gonna Give You Up” amplitude envelope. Image by the author.

Next, we zoom in on the most recognizable part of the song: the chorus, where Rick sings “Never Gonna Give You Up.”

Rick Astley’s “Never Gonna Give You Up” amplitude envelope. Image by the author.

Similar to how we processed the financial data, we apply smoothing transformations to the amplitude envelope of the chorus. Smoothing helps to eliminate any remaining sharp transitions, ensuring the resulting path is suitable for trajectoid generation.

Trajectoid Generated from “Never Gonna Give You Up” segment amplitude envelope. Image by the author.

Let’s also visualize the path on the surface of the trajectoid to demonstrate the rolling better:

Trajectoid and the Path Generated from the “Never Gonna Give You Up” segment amplitude envelope. Image by the author.

When rolled on a slope, this shape will draw out a path that represents the amplitude envelope of Astley's iconic “Never Gonna Give You Up”. Giving us the ultimate data-driven rickroll that actually rolls while rickrolling.

If you would like to 3d print the trajectoid you can find the .stl file and the full code behind the article here:

GitHub - Geometrein/trajectoids: How to Rick Roll with Trajectoids

About the Author

I am a curious Data Scientist with a strong passion for finding and understanding patterns. My interests include Math, Computer Science, Architecture & Urbanism. You can connect with me on LinkedIn and Github.

References

Original Trajectoids Article: Yaroslav I. Sobolev, Ruoyu Dong, Tsvi Tlusty, Jean-Pierre Eckmann, Steve Granick, and Bartosz A. Grzybowski. “Solid-body trajectoids shaped to roll along desired pathways”. Nature, 620, 310–315 (2023). https://doi.org/10.1038/s41586-023-06306-y

Open Acess PDF: https://www.nature.com/articles/s41586-023-06306-y.epdf

Notes:

*The time series are smoothed with a rolling average and differenced. This helps avoid situations where we need to work with large values in the range of thousands and prevents the path from being too jerky (a known problem for trajectoids)

How to Rick Roll Like a Data Scientist? was originally published in Data Science Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.

Visualising the Global Arms Trade Network

Tigran Khachatryan — Tue, 30 Jul 2024 13:02:19 GMT

The Deadly Silk Road: Analising the Global Arms Trade as a spatio-temporal Network

Chord Diagram of Global Arm Transfers. Image by the Author. © github/geometrein

Introduction

In the last article, we discussed how the CIA utilized network science for the war effort. In this article, we will look into the logistical Silk Roads that fuel modern conflicts with a fresh supply of arms worldwide. The dataset explored here includes more than 10.000 arm transfer deals made worldwide between 2000–2023. It contains valuable insights into the flow of arms across countries, the types of weapons traded, and the major players in the global arms market.

The chart below illustrates the number of trade deals made for each weapon category between 2000 and 2023 by all countries, organizations and military alliances worldwide.

Image by the Author. © github/geometrein

Aircraft deals clearly lead the category with the highest quantity of deals, followed by missiles and armoured vehicles. The dominance of aircraft and missiles suggests an emphasis on enhancing air dominance and long-range strike potential typical for modern warfare. However, this representation might be misleading since aircraft deals typically include a few aircraft while missile deals include several thousands of missiles. Let's visualize the number of units transferred in each deal instead.

Image by the Author. © github/geometrein

As suspected missiles dominate the arms transfers while the aircraft transfers fall back to 5th place (mind the logarithmic scale). Since missiles are the main weapon of modern wars and the core part of our dataset we can focus on missile transfers and eliminate the complexity introduced by multimodality of the trade.

Missiles Transfers

Which countries are the largest missile exporters?

Image by the Author. © github/geometrein

The United States is the leading missile exporter by a large margin, followed by Russia, Israel, and Sweden. The chart reflects the dominance of the USA in the global arms market due to its advanced missile technology and extensive military-industrial complexes.

Which countries are the largest importers?

Image by the Author. © github/geometrein

Saudi Arabia leads as the top importer, followed by Ukraine, India, the UAE, and Poland. The quantities indicate a significant demand for missiles in these countries, likely reflecting regional security concerns, military modernization programs, or involvement in ongoing conflicts. This pattern of imports underscores the strategic priorities of these nations and their reliance on advanced missile technologies to bolster their defence capabilities. The high import levels in these regions highlight the geopolitical tensions and the emphasis on maintaining robust military arsenals.

Are there more missiles being transferred?

The chart below illustrates the volume of recorded Missile trades worldwide between 2000 and 2023.

Image by the Author. © github/geometrein

There is a noticeable increase in missile exports around every major conflict escalation. Over the past two decades, the overall number of exported missiles has steadily risen, indicating a growing trend in global arms transfers. This trend reached an unprecedented spike with the invasion of Ukraine, highlighting the significant impact of geopolitical tensions on the arms trade. The sharp rise in missile exports during such critical events underscores the strategic importance of missile technology in modern warfare and the increasing reliance on these weapons by nations involved in conflicts.

Which missiles are sold the most?

Missiles are products, and naturally, some are more popular than others. The chart below illustrates the top product exports from each country over the past 23 years. This data highlights the varying demand and preference for different missile types across nations, reflecting the strategic priorities and technological advancements of each country. The chart offers a glimpse into the specific missile technologies that have dominated international trade over the past two decades.

Sunburst chart of missile exporters. Image by the Author. © github/geometrein

Military Alliances

Arms transfers not only provide insight into trade dynamics but also serve as a proxy for understanding military alliances. Conflicting nations are less likely to engage in arms trade, opting instead for direct missile exchanges on the battlefield. Conversely, cooperating countries demonstrate their alliances by purchasing or supplying arms to each other. Based on this hypothesis, applying a community detection algorithm to our trade network should reveal military alliances established in the modern world.

As expected there are two main camps. The United States, the UK, a large part of the EU and Israel form one community while Russia and China create their own. An interesting finding is that seems like France has been trading arms with the same countries that Russia and China normally trade with. Another interesting community can be seen between North Korea, Hamas and PRC likely an n attempt by Kim Jung Un to side hassle and raise some cash.

We can visualize the arms trade relation network geographically to gain a comprehensive understanding of how trade dynamics have evolved over time. By examining these visualizations, we can identify patterns and trends in arms transactions. The animation below provides a dynamic illustration of the global arms trade network, highlighting key changes and developments in the industry. This visualization starkly illustrates how geopolitical events and escalating regional conflicts influence the flow of arms.

Image by the Author. © github/geometrein

There are several noteworthy timestamps in the animation that merit attention. For instance, it is noticeable how arms transfers to Ukraine skyrocketed following the Russian invasion. Ironically, the military operation that intended to demilitarize Ukraine has led to an unprecedented concentration of arms in the country. Additionally, it illustrates how the COVID-19 pandemic impacted civilian logistic infrastructure and military supply chains, affecting the dynamics of arms trade worldwide. The flow of also arms clearly illustrates how the state of mutually assured destruction has simply channelled the ongoing inter-species competition for resources between the larger states into proxy wars fought by third parties. The conflicts are shifted geographically but the main air suppliers remain unchanged.

Code & Data

The underlying dataset has a lot more interesting relationships to explore that were not included in this article. The code behind the visualizations, cleaned and organized data and some extra visualizations can be found in this repo:

GitHub - Geometrein/networks

The cleaned data used for this article can be found here:

networks/projects/the_deadly_silk_road/data at main · Geometrein/networks

The final dataset was created based on SIPRI’s data. All the credit for meticulously recording and publishing the data goes to them.

About the Author

Tigran is a curious Data Scientist with a strong passion for finding and understanding patterns. His interests include Math, Computer Science, Architecture & Urbanism. You can connect with him on LinkedIn or Github.

How the CIA Used Network Science to Win Wars

Tigran Khachatryan — Tue, 16 Jul 2024 13:02:43 GMT

The Max-Flows Min-Cuts theorem

Image by the Author

The importance of logistics in modern warfare has only risen since the invention of total war during the Napoleonic era. During the First World War US Army General John J. Pershing famously noted that:

“Infantry wins battles, logistics wins wars.”

US Army General John J. Pershing, commander of the American Expeditionary Forces on the Western Front during WWI.

This theory remained in the US Army's arsenal even during the Cold War. After the Second World War, rail networks were a critical component of Soviet logistics, used for moving military supplies, troops, and economic goods across the vast expanse of the country. Understanding and potentially disrupting the logistics and transportation networks of the USSR was of strategic interest to the United States. Therefore, in the 1950s, the U.S. Air Force commissioned a study to determine how to effectively cut off rail infrastructure within the Soviet Union and Eastern Europe in case of a potential conflict. The study was conducted by mathematician T.E. Harris and retired U.S. Air Force General F.S. Ross. In 1955, they published a classified report titled “Fundamentals of a Method for Evaluating Rail Net Capacities” that modelled the Soviet rail infrastructure as a flow network.

In the first of its kind report authors modelled the Soviet rail network as a set of nodes and links. The nodes represent the aggregated rail infrastructure of the region, while the edges represent the connections between the transportation hubs. Each link was assigned a weight based on the estimated tonnage capacity that can be transferred from one node to the other within a day (see illustration below).

Fundamentals of a Method for Evaluating Rail Net Capacities T.E. Harris & F.S. Ross

The study's goal was to estimate the rail network's capacity between the nodes and determine which links were the most optimal airstrike targets to paralyse the flow of cargo.

While this might seem like a straightforward task there is a catch! In well-connected networks incapacitating two nodes or the link between them doesn't necessarily stop the flow between the nodes. Often the flow can simply be re-routed through a neighbouring node.

In the illustration above if the link between nodes A & B is severed the nodes will still be connected through node C and some of the cargo can still reliably reach its destination. When this principle is applied to large networks with a great number of interconnected triplets, finding the most optimal cuts becomes a compelling challenge. We can illustrate this by recreating the original Harris-Ross network of the Soviet Rail system with Networkx and see what that means in practice.

Recreated Ross Harris network. Image by the Author

The illustration above demonstrates the recreated Harris-Ross network. It is based on the network presented in the original paper and has the following properties:

Number of nodes: 42
Number of edges: 91
Average degree: 4.33
Density: 0.1057
Average clustering coefficient: 0.4364
Diameter: 8
Average shortest path length: 3.55

Which links should be removed to sever the connection between two arbitrarily selected nodes? The answer is far from obvious. There is more than one path connecting a pair of nodes hence the excess load created by a missing link can be distributed onto existing links without an issue. We can illustrate this with a more traditional approach and try to remove the most important links/nodes from the network and see how that affects the network's connectivity. We can simulate hypothetical air strikes on the network and check how the size of the giant component in the network changes after each strike removes the important nodes and links.

Image by the Author.

The charts above illustrate how the network will disintegrate after each hypothetical air strike disables a link/node. The speed of the network's disintegration is compared to the removal of random links/nodes so there is a benchmark to compare the strategy against. As we can see in the case of Harris and Ross network removing the most important nodes or edges has a comparable effect on the integrity of the graph as removing random nodes and edges. Networks in both cases disintegrate at a similar phase. This illustrates that for this network targeting large railway depots or links that have the most capacity doesn’t necessarily result in better outcomes than playing a military whack-a-mole and removing random depots and links.

Max-Flow Min-Cut theorem

To address this limitation, Harris and Ross in cooperation with Dantzig and Fulkerson developed a theoretical and practical method that today is known as the Max-Flow Min-Cut theorem. The theorem states that in a flow network, the maximum amount of flow that can be sent from the source node to the target node is equal to the total weight of the edges in the smallest cut that separates the source and target.

An easy way to conceptualize the idea behind the theorem is to imagine a system of pipes carrying water from a reservoir (source node) to a tank (sink node). Each pipe has a certain capacity for water flow. The Max Flow Min Cut Theorem states that the maximum water flow you can achieve (max flow) is exactly equal to the total capacity of the narrowest points you need to block to stop all water from getting to the tank (min cut).

We can apply Max-Flow & Min-Cut algorithms to analyse the network capacity between two nodes and determine the most optimal airstrike links for a set of nodes. The graphs below illustrate the optimal airstrike targets for Kyiv-Berlin and Moscow-Donetsk target-sink pairs:

Minimal cut for Target: Kyiv Sink: Berlin. Image by the Author

Right: Minimal cut for Taerget Moscow Sink Donetsk. Image by the Author

The charts highlight the links that serve as the minimal cut between the selected target and sink nodes. These would serve as the optimal airstrike targets to incapacitate the cargo transfer capacity between the two selected nodes. We can iterate over all nodes in the network and visualize the min cuts for each of the node pairs. The animation below illustrates how minimal cuts would look when the target node is Berlin and the sink node is randomly selected from the network.

Image by the Author

In these visualizations, the Max Flow Min Cut algorithm demonstrates its effectiveness in identifying the critical points of failure within a network. By targeting these minimal-cut links, the cargo transfer capacity between Berlin and any other node can be fully disrupted, highlighting potential vulnerabilities. The algorithm is particularly powerful because it provides a clear, quantifiable measure of the network’s resilience and the potential impact of targeted disruptions. What the Harris-Ross method does differently is that it takes into account the capacity of the rest of the network to compensate for the lost links and this allows for more accurate modelling. This consideration of network redundancy and the ability to reroute flows makes the Harris-Ross method more robust for practical applications where network elements may be designed to adapt to losses.

Conclusion

While the US Air Force never really used the method against the USSR, the algorithm saw many improvements over time. Its modern and more complex variants are used today to manage traffic flows and prevent bottlenecks. They are crucial for identifying points that could lead to cascading failures in power grids. Businesses leverage these methods for decision-making and understanding the critical points in their supply chains and operational networks. Social media networks, use these methods to analyze and optimize the flow of information.

Moreover, in the realm of computer vision, improved versions of these algorithms are employed in image segmentation tasks, allowing for efficient and effective image processing.

Overall, the evolution of the Max Flow Min Cut theorem and its derivatives highlights the importance of robust network analysis techniques in an ever-increasingly connected world. These methods not only provide insights into potential vulnerabilities but also offer strategies for enhancing the resilience and efficiency of complex systems across various domains

The code

You can find all the code used for the article in this repo:

networks/projects/max_flow_min_cut_theorem at main · Geometrein/networks

Data Sources

All the data used for the article can be found here:

networks/projects/max_flow_min_cut_theorem/data at main · Geometrein/networks

References

T.E. Harris, F.S. Ross: Harris-Ross fundamentals-of-evaluating-rail-net-capacities
G. Dantzig, D. R. Fulkerson: On the max flow min cut theorem of networks

How the CIA Used Network Science to Win Wars was originally published in Data Science Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Finnish Line

Tigran Khachatryan — Mon, 17 Jul 2023 07:02:20 GMT

Statistical critique of Finland’s 2023 government plans and immigration policies.

On June 16, 2023, Finland’s new government announced the government program titled “A Strong and Committed Finland”. The plan was received with mixed reception and faced criticism on multiple fronts.

One of the most controversial changes was the 180 turn on immigration and labour policies. Some of the changes included:

8 years of residence period for obtaining citizenship.
6 years of residence period for obtaining permanent residence.
Cancellation of work-based residence permits after 3 months of unemployment.

Undoubtedly, these rules foster the creation of parallel societies in Finland where the immigrant population pays equal taxes but gets fewer rights for an extended amount of time. However, since the new Minister of the Interior Mari Rantanen claims to believe in numbers and statistics:

we will use exactly numbers and statistics to see what the potential economic impact of these new policies can be and if arguments made by the new parties have any factual grounds.

There are many aspects of the government plan worth discussing but this article will mostly focus on work-based immigration and its socio-economic effects. The article will illustrate that in the end, it is the Finnish citizens who will have to pay the hefty price for the anti-immigrant policies planned by the new government.

Finland’s Demographic Structure

Finland’s population is ageing. Over the past few decades, life expectancy has risen while the birth rate has decreased consistently. If the current low birth rates and increasing life expectancy continue at the same rate, the working-age population is projected to decrease by nearly 20% in the next 40 years. This trend can be clearly observed in Finland’s population pyramid presented below.

Image by the author.

The shrinking bottom and the growing top of Finland’s population pyramid for the past 20 years have been gradually inverting it. The inverted population pyramids pose a significant threat to the national economy since they can seriously destabilise the financial equation holding the welfare state intact. With a smaller working-age population, tax income will decrease, while the costs of healthcare services will increase due to longer life expectancy. Moreover, any substantial increase in the birth rate would be slow to impact the workforce and would incur additional spending on education and upbringing of younger individuals to working age.

Image by the Author

For the past 20 years, Finland has formed 11 cabinets with vastly different programs, however, there is no visible deceleration in the population structure decay. The graph above illustrates how the age dependency ratio has grown for the past 20 years with each governmental term. By the end of 2022, Finland’s age dependency ratio was 62.5%. The only two developed countries that have a higher age dependency ratio are Japan and Israel.

The longer this upward trend is sustained, the harder it will become for any government to address the issue. The decreasing tax income will impact the ability of any government to enact successful policies that can effectively tackle the projected crisis. This will inevitably lead to more borrowing with a less working population capable to pay off the rising government’s dept.

Decreasing the unemployment rate as suggested by the new government might seem like a sensible intuitive solution at first, however, even achieving the unrealistic 0% per cent unemployment rate is not going to solve the problem from a demographic perspective. The age profile of the unemployed population is already included in Finland’s demographic pyramid. Hence their employment status has little effect on the situation at hand. Sooner or later the working population will feel the rising burden of supporting the dependent population. In this case, the government’s options are rather limited. The government can attempt to increase the birth rates, shorten expectancy or increase immigration. The first, if implemented effectively will have an effect in 20 years, and has to remain effective for at least 20 years to have a meaningful long-term impact. The second is criminally malicious towards the elderly population and can be achieved by limiting access to public healthcare (something that the current government already intends to do). Third, if done correctly can have a quick and potentially meaningful long-term effect on the economy.

Immigration

Some of the governmental parties and a sizable body of research are suggesting immigration as a possible solution for addressing the issue of the ageing population. However, for immigration to be effective, Finland needs to step up its game for competing over a limited pool of immigrant workers. The new changes introduced in the government plan make other European states more attractive to an average immigrant that has a delicate equation to balance between salary & tax levels, cost of living, and citizenship timeline. While salaries in Finland are relatively high the cost of living has been rising steadily which has a detrimental effect on Finland’s competitive edge. Increased permanent residency and citizenship timelines diminish the advantages altogether.

Between 2015 and 2022, the level of work-based immigration has been rising yet it remained rather limited. As depicted in the graph below only 68.500 new work-based residence permits were issued in a period of 8 years (approx. 8500 per year on average).

Out of these 68.500, only 10.000 applied for a specialist visa (approx. 1250 per year on average)

It is also visible that work-based immigration is second in the total number of applications. Family-based residence permits that largely are granted to families of work-based immigrants are the largest portion of immigrants. Looking at the same group of people from the perspective of the age composition illustrates that indeed all people belong to the groups that Finland is in dire need of. Young families arriving in Finland not only help with short-term labour issues but also give a long-term boost to birth rates.

Additionally, the cultural background of the vast majority of immigrants is predominantly European. Nearly half of all immigrants are either from Russia or Ukraine. This eliminates the large cultural gaps and decreases potential cultural integration time. Hence, the xenophobic sentiments often used to fearmonger by the new government are not supported by the very same government’s migration office.

Looking at the official statistics from Migri in juxtaposition with the population pyramid it became clear that the amount of new “imported” workforce is critically low if Finland intends to support the ageing population. The immigration patterns observed above illustrate that so far past governments have adopted a sort of homoeopathic policies towards immigration that proposed watered-down policies that mostly failed to position Finland as a working hub in the international labour market. However, during the past years, the problem has become too serious to experiment with home remedies. It has become clear that the decay of demographic structure is not an issue that can be solved by internal means and external “medicine” is necessary to prevent the looming crisis.

Immigration strengthens the solvency of the social security system by immediately boosting the number of new workers. From a purely financial point of view, immigrants are a ready workforce that Finnland’s taxpayers did not pay for 20 years to raise, educate and care for. Additionally, workers that bring their families and kids provide additional support to the decaying demographic pyramid with close to zero upfront investment from the state. While it is true that spouses of newly immigrated workers might become recipients of social benefits, the social benefits paid to these population groups for several years are far less than what Finland would have to pay to raise the same number of people to working age in Finland. Furthermore, the social benefits paid out to people who reside in Finland are not a net negative from an economic perspective. Social benefits are economic stimulus checks distributed to the weakest groups of the population. Immigrant social security support is the money that they spend inside the country and it’s the same money that becomes local’s income(income that is also taxed by the government).

Immigration has all the right components for addressing the looming demographic crisis. However, the components alone are not enough to fashion a functioning machine. These components should be systematically put together into a larger framework by the Government through increasing immigration volumes, speeding up integration timelines and motivating international workers to arrive and stay in Finland.

“Yes, but…”

Nevertheless, the proponents of anti-immigration policies materialized in the form of the new government often take a stance that the potential downsides of immigration are far worse than increasing the burden on the working population. The most commonly voiced arguments include:

Claims that increased immigration will raise crime rates.
Threats that immigrants will take the jobs of the locals.

Luckily, in Finland, there is open data available for validating/refuting these claims.

“Immigrants Increase Crime Rates”

The claim that increasing immigration volume leads to an increase in crime rates is one of the most commonly heard statements among those who are opposed to pro-immigration policies. Luckily this hypothesis can be easily tested.

First, let’s have a look at the total number of crimes committed in Finland between 2015 and 2022 juxtaposed with the total number of new residence permits granted.

If letting more immigrants into the country increases the crime rates, then these two time series should be correlated(when one goes up the other should follow). Already visually we can observe that there is very little correlation between the two. Nevertheless, we need to take into account that there can be lagged correlations between them. Meaning that perhaps the effects of letting immigrants in are not affecting the crime rates immediately but rather after some time. To test this assumption we can perform a simple cross-correlation between the two time series. This involves moving one time series along the other and checking for correlations at every step. The results can be seen below.

As shown in the chart above there are extremely weak negative and positive correlations that are not statistically significant (If there were statistically significant the main blue line would cross one of the red lines. Crossing the top red line would indicate a positive correlation while crossing the bottom red line would indicate a negative correlation). Hence, we can refute the statement that increasing immigration leads to increased crime rates since this has not been the case in the past 8 years for Finland.*

Nevertheless, to ensure that immigrants are not blamed for the crime rates we can further break down the number of committed crimes based on the nationality of the suspect who committed the offence. The graph below illustrates the percentage of crimes committed in Finland by nationality between 2000 and 2022.

As we can see, as of 2022, about 88.3% of all crimes are committed by Finnish nationals, about 7.2% are committed by other European nationals and only about 4.3% of all crimes are committed by Non-EU nationals. If we translate these values to their representative populations then crime rates amongst Finnish citizens would sit at around 5.69% while the crime rates among Non-EU nationals would sit at 3.02%. Hence, Finnish nationals are 1.88 times more likely to commit a crime than Non-EU Nationals. This does not come as a surprise to any immigrant, since all immigrants are well aware that they might lose their residence permits in case of any criminal involvement. Already at the application stage, all residence permits ask the applicant to disclose all criminal offences therefore even those who never committed a crime know that this information is part of the consideration when granting a residence permit. This awareness can be one of the reasons why Non-EU nationals are more careful with the law than Finnish nationals who will not face deportation regardless of the severity of the committed crimes.

To conclude, the number of granted residence permits is not correlated with crime rates and crime rates among Non-EU citizens are almost twice lower than crime rates among locals.

“Immigrants Will Take Our Jobs”

This argument commonly stems from oversimplification of the economic relationships. The number of jobs in the Finlands economy is not fixed. A hired worker increases the productivity of the company which leads to growth allowing for more workplaces to appear. It is true that in some segments, migrants may compete with existing workers in Finland, however, this competition will not take place across all skill levels and industries. It will make everyone involved in the labour market more productive by encouraging professional skill-building. In the end, this competition will lead to increased productivity of the overall market which is again a net positive for everyone living in Finland. Additionally, if a company can hire qualified employees faster the given employee can start contributing faster leading again to longer periods of increased productivity.

If hiring is only limited to local professionals then in cases when necessary specialists are not found locally the company will need to train and invest in advance before reaching a productivity increase. These sorts of delays are detrimental to the growth of the economy at best and can be fatal for certain companies at worst.

Additionally, according to statistics Finlands data entrepreneurship levels among immigrants are higher than among Finns. The entrepreneurship levels among migrants have been gradually rising since the year 2000.

This can potentially mean that an immigrant is slightly more likely to start a business and create new workplaces than a local. These figures are even more outstanding when you consider that it is much harder for a foreigner to find a job or to start a business in Finland because of the language barrier and lack of connections that play a big role in the local market.

Let us also not forget that many of the traditional brands that Finland is proud of today were established by immigrants, including Arabia, Finlayson, Sinebrychoff, and Paulig.

To conclude, the number of jobs in the economy is not fixed, Immigrants have a higher rate of entrepreneurship and potentially a higher likelihood of creating new workplaces.

Final Thoughts

No immigrants have moved to Finland because they enjoy the long dark winters, lovely Novembers or cherish being bitten by a giant mosquito. They immigrated despite all the difficulties motivated by the idea of Finland as an open country. If this idea and the common vision that attracted people cease to exist many other countries will instantly become more attractive. If the Finland that people chose to migrate to no longer exists, immigrants that are not welcome will look for a new home(again).

The immigration course outlined in the new government plan is not Finland that people fell in love with. It is discriminatory. It fosters the creation of parallel societies where immigrants are considered second-class citizens that pay equal taxes yet receive fewer rights for an extended amount of time. Immigrants will be less protected by social security and the legal system. A foreigner that lived in Finland for 8 years can now lose their job for 3 months and be deported. This policy will eliminate all reasons for foreigners to contribute to social security and unemployment funds because they are no longer protected by them.

What makes the situation even worse for immigrants is that they had no say in any of these choices. (Immigrants in Finland don’t have full voting rights.) The past months for many immigrants passed in shock and disbelief while they quietly watched how the Finnish neighbours, friends and colleagues they lived with for many years decided their future for them. This situation will inevitably prompt many immigrants to take control of their future by either protesting the changes or leaving Finland.

The scenario of working migrants leaving the country is perhaps the worst case for Finnish citizens. Only from a financial point of view, a working immigrant that leaves Finland is a monthly rent unpaid, a salary not spent, savings not invested, a product not bought and a salary not received by the locals. Additionally, if a worker leaves Finland they instantly become a liability for Finland because the state has to return their pension contributions. This money will leave Finland and be spent elsewhere making Finland pay for boosting another country’s economy.

It is in Finnish citizens’ utmost interest that immigration volumes increase and the immigrants that are already in Finland remain in Finland. The economy is a complex adaptive system and as with any other complex system, it is impossible to remove parts of this system and guarantee that it will not result in cascading failures.

I have no doubt that these policies will end up being suicidal for the new government. The discrimination against immigrants has made immigrants ever more united and ever more involved in politics. Educated migrants will protest, support opposition parties financially and make their voices heard. It is only a matter of time before the populism that disregards statistics and taps into xenophobic sentiments exhausts itself.

At the end of the day, the Finnish citizens are the ones who would have to live in Finland of tomorrow which is being shaped today. It is the Finnish citizens that would bear the responsibility for the reckless policies proposed. Hence, I strongly encourage my network to stay vocal on the subject not for the sake of immigrants but for your own. Any immigrant will tell you that it’s always better to try to fix things before it’s too late instead of slowly losing your home and suddenly having no other choice but to migrate.

Data & Code :

All data extracted from the StatFin database and the code used for the analysis are included in the repository for reproducibility.

GitHub - Geometrein/the-finnish-line: This repo contains an analysis used in an article about Finland's new immigration policies.

Feel free to contribute, suggest edits and extend the analysis.

Notes:

The views presented in this article belong solely to the author and do not necessarily reflect the opinions of any affiliated individuals or organizations. Readers are encouraged to conduct their own context-specific research before making any decisions based on the content of this article.

*Note that this study does not include refugees that were arriving in Finland during the 2014 war in Ukraine, the 2015 refugee crisis and the 2022 war in Ukraine. If we also include the refugees then the argument against immigrants increasing the crime rates will be even stronger.

About the Author

References:

Wolt Delivery Network Analysis

Tigran Khachatryan — Mon, 14 Feb 2022 07:59:47 GMT

Food Delivery Distances in Helsinki During August 2020.

Coronavirus pandemic had a substantial influence on urban mobility throughout the world. Telecommunication affected fundamental transportation parameters such as congestion, passenger/freight ratios, and greenhouse gas emissions while last-mile delivery services became a crucial part of urban life. Although, most of the restrictive measures have been already lifted the habits and patterns developed during the pandemic left their effects on consumer behaviour. In this article, we will analyse the Helsinki food delivery network during the summer of 2020.*

Helsinki has a diverse delivery ecosystem but when it comes to food delivery there are two major players Wolt and Foodora. The comprehensive analysis of the food delivery system would imply taking into account the data from both delivery services. However, when dealing with private companies, data availability becomes a serious bottleneck. Therefore, only limited available data from Wolt was used for this article.

Ordering Patterns

Before looking at the delivery network let's first take a brief look at ordering patterns. Daily and weekly cycles influence all human activities this is, even more, the case with food delivery services. Over the course of their life humans develop persistent habits that sync with daily cycles. As we can see from the graph below there are very district hourly ordering patterns for food deliveries.

The Number of orders by day of the week and hour of the day. Image by the author.

Each day seems to have two peaks in the number of orders.
Hottest ordering times are slightly different for workdays and weekends.
During the workdays number of orders, peaks at 8 am and 16pm with a decrease in orders during the lunchtime.
The weekends exhibit a similar behaviour but with a higher overall number of orders and with different peaks at 10–11 am and 15–16 pm.

If you're a food delivery company this is great news because you can predict the demand based on deterministic trends, scale and shrink your delivery infrastructure proportionately to match the hourly demand. The overall larger trends can affect the total number of orders but the hourly ordering patterns should remain relatively unaltered.

If you’re interested in in depth exploration of ordering patterns you can find a more detailed analysis and time series forecast here:

Time Series Forecasting with SARIMAX

Wolt’s current business model uses a distance-based progressive pricing model for the delivery fees. Currently, the distance-based delivery fee rates are 1.90 € up to 1 km, 3.90 € up to 2 km, 5.90 € up to 3 km, and 7.90 € for deliveries over 3 km.

Wolt hourly courier deliveries between 01.08.2020 and 07.08.2020. Image by the author.

As we can see from the animation above this pricing model affects the ordering behaviour. The majority of deliveries are short distance deliveries with occasional longer deliveries.

Wolt Delivery Network

Now that we have some understanding of the ordering patterns we can take a look at the delivery network. In order to generate our network, we will regard restaurants as origin nodes and users as destination nodes. If the restaurant has delivered to the user the two nodes will be linked with a directed edge. The graph below illustrates the generated delivery network. Origin(restaurants) nodes are marked with orange colour while destination(users) nodes with purple.

We can see that the central Helsinki area has a great concentration of restaurants while the users are relatively evenly distributed throughout the map. Kallio, Pasila, Töölö neighbourhoods for smaller restaurant hubs.

The resulting network has the following basic properties:

DiGraph with 2098 nodes and 15901 edges
Average Degree: 15.158
Edge Density: 0.0036
Average Clustering Coefficient: 0.0941
Transitivity: 0.0110

Our network has a unique structure. The low edge density and Avg. Clustering Coefficient doesn’t come as a surprise since Wolt makes deliveries only from venues to users. Users are not connected amongst each other. This results in our network having two district kinds of nodes. The destination nodes will only have out-degree while the user nodes will only have in-degree. The high in-degree of a node would mean that the user is ordering from a diverse set of restaurants. While high out-degree would show popular restaurants that are delivering to many unique users. Below we can see the top 5 nodes by in-degree and out-degree.

Top 5 Users that have the most diverse diets (Node in-degree):
Node: 322, In-degree: 35
Node: 379, In-degree: 33
Node: 456, In-degree: 32
Node: 492, In-degree: 32
Node: 556, In-degree: 32

Top 5 Venues with most unique users (Node out-degree):
Node: 39, Out-degree: 362
Node: 21, Out-degree: 350
Node: 9, Out-degree: 337
Node: 90, Out-degree: 337
Node: 100, Out-degree: 274

In order to understand the structure of deliveries a little better, we can plot the degree distribution of the network. The graph below shows the distribution of nodes by In-degree. We can see that on average users order from 8 different restaurants while 50% of users order from up to different 6 restaurants (median).

Delivery network in-degree distribution and complementary cumulative degree distribution. Images by the Author.

The story is very different for restaurants. The histogram below shows that the majority of restaurants seem to deliver to only one user while some restaurants are delivering to up to 350 unique users.

Out-degree distribution. Image by the author.

We can also take a look at networks in-degree centrality to determine where the users with the most diverse diets are living. 🍔

In-Degree Centrality of the Wolt delivery network.

As we can see inhabitants of Kamppi, Kallio and Töölö neighbourhoods prefer some diversity in their diets while other areas are a little more conservative with their choices.

In fact, if we apply a community detection algorithm to our network the same 3 neighbourhoods can be identified as distinct communities.

This kind of separation does not come as a surprise when we consider the geographic characteristics of Helsinki. The void in the middle of the map is the area where railway tracks split Helsinki into two. Naturally, this divide leads to increased delivery distances and increased fees associated with the delivery. Hence users often prefer ordering from local restaurants. This keeps the fees low and the 🍕 warm!

Error & Attack Tolerance

Studies on the Error and attack tolerance of the networks are based on Percolation theory. Percolation theory is a sub-field of statistical physics that describes the formation and behaviour of connected clusters in random systems. Although percolation theory’s origin is rooted in the study of lattices and random systems the concept and theory are also useful in the study of connectivity of empirical networks such as the delivery network studied here. Error and Attack tolerances help us understand the robustness of the network. Many complex systems display a surprising degree of tolerance against errors. For instance, relatively simple organisms grow and reproduce despite drastic environmental interventions. This is due to their error and attack tolerance attributed to the robustness of the underlying metabolic network.

Errors: Nodes/edges are removed randomly from the graph until the network disintegrates.
Attack: Nodes/edges are removed based on their importance(degree, weight, centrality) until the size of the giant component of the network reaches 0.

The logic being the naming is that errors are random and can happen to any element in the network while attacks firstly target the “important” elements of the network.

Removing Links

The graph below shows the Error and Attack Tolerance of the delivery network when links between the nodes are removed. In real life scenario removal of links simulates what would happen if orders will stop being made.

Image by the author.

The red line illustrates how quickly the network would disintegrate if random links are removed i.e. Error tolerance. The green line illustrates the robustness of the networks when the nodes connecting high degree nodes are removed i.e. Attack tolerance. The blue line illustrates the removal of links between low degree nodes.

The key takeouts are that if Wolt’s popularity declines and the users stop ordering one by one the delivery network will disintegrate first gradually and then suddenly. Furthermore, the removal of users with diverse diets will disintegrate the network slightly faster.

Removing Nodes

What if a new pandemic happens and government forces restaurants to shut down completely? How will this affect the delivery network? We can remove nodes from our network and see how quickly it will disintegrate.

Image by the author.

As we can see removing high degree nodes (popular restaurants) sharply decreases the size of the giant components in the network(green line). Removing only the top 10% of the high degree nodes results in an 80% per cent decrease in the size of the giant component of the network. In other words, the delivery network will disintegrate sharply as the popular restaurants start to close down or leave the network. The same cannot be said about users or unpopular restaurants where the relationship is close to linear.

Growing the Network

Suppose a new user joins Wolt’s customer base. Can we predict which restaurants the user will be ordering from?

We can assume that restaurants that are popular among the existing users might have a higher probability of attracting the new user. This phenomenon is known as Matthew effect or preferential attachment commonly referred to as “rich get richer”. Based on this we can stimulate the growth of the network by adding a single user at a time and connecting it to existing nodes with a certain probability.

In order to preserve the properties of the original network, we need two parameters in our growth simulation.

New nodes should attach to existing restaurants based on the restaurant’s out-degree. This way popular restaurants will have a higher chance of getting a new customer.
New nodes should have in degree between 1–35 with the same probability distribution as in our current network. This way new consumer behaviour will simulate the behaviour of existing users.

With this in mind, we can add 2000 new customers and see what happens with our network

DiGraph with 5000 nodes and 18804 edges
Average Degree: 7.5216
Edge Density: 0.0007523104620924184
Average Clustering Coefficient: 0.03917784873040924
Transitivity: 0.00808486679523513

Top 5 Users that have the most diverse diets (in degree):
(322, 35)
(379, 33)
(456, 32)
(492, 32)
(556, 32)

Top 5 Venues with most unique users (out degree):
(21, 421)
(39, 413)
(90, 408)
(9, 393)
(100, 313)

As we can see from basic network stats above now we have 5000 nodes and 18800 edges. The average degree has decreased from 15.1 to 7.5. The clustering coefficient however has increased. This is due to the fact that new users ordered from a different selection of restaurants thus increasing the clustering in the network.

Further development

While our dataset is limited, the data on restaurants and their popularity are available from Open Streetmap. By using our knowledge of the existing network we could extrapolate and grow our network accordingly. This would allow us to model the actual larger delivery network that was trimmed from the data.

Conclusion

In this article, we analysed Helsinki’s food delivery network primarily based on the data from Wolt. We looked at some of the characteristics of the network, applied a clustering algorithm, analysed the Error and Attack Tolerance of the network and simulated its growth. If you are interested in continuing this exploration you can find the code behind the analysis here:

GitHub - Geometrein/helsinki-delivery-network

About the Author

Time Series Forecasting with SARIMAX

Tigran Khachatryan — Sat, 12 Feb 2022 11:33:17 GMT

In this article we will implement a SARIMAX model in order to forecast the sales of a food delivery company. This is a short version of the analysis the more in depth version and the code can be found in this repo:

GitHub - Geometrein/sarimax: Quick implementation of ARIMA and SARIMAX models on dummy delivery orders data.

EDA

The underlying sales dataset contain two months of sales data generated in Helsinki, Finland. The dataset has the following variables:

See more details in code. Image by the author.

Let’s look at the available data and see what aspects are covered by the dataset.

Image by the author.

Observations

At a first glance, it seems that there are no immediately apparent relationships between the variables.
ACTUAL_DELIVERY_MINUTES — ESTIMATED_DELIVERY_MINUTES is strongly correlated with ACTUAL_DELIVERY_MINUTES and ESTIMATED_DELIVERY_MINUTES but that doesn’t give us new or useful information.
There are correlations between coordinates. Not a very useful relationship to explore either.

Since the pair-plot didn’t reveal any interesting patterns, it might be useful to look for patterns in places that didn’t show up in the pair-plot.

Time Series Analysis

Looking at the dataset as a time series will show if there are any temporal patterns. An interesting variable might be number of orders over time that can give insights into user ordering patterns.

Number of orders is an important variable because:

It coincides with delivery company’s business interests in ways described in the introduction.
Causal relationships between time and number of orders are understandable and interpretable. Same cannot be said about for example weather or user location.

Let’s plot a heat-map with days of the week, times of the day and number of orders. This will reveal the hourly ordering patterns in our dataset.

Image by the author.

Observations

The heat-map reveals a set of interesting patterns.

Each day seems to have two peaks in number of orders.
Hottest ordering times are slightly different for workdays and weekends.
During the workdays number of orders peaks at 8am and 16pm with decrease in orders during the lunchtime.
The weekends exhibit a similar behaviour but higher overall number of orders and with different peaks at 10–11am and 15–16pm.

Image by the author.

Observations:

Trend: There doesn’t seem to be definite upwards or downwards trend. However, we remember from daily plot that there is a trend.
Seasonality: There is very strong daily seasonal pattern. And we remember from daily plot that there is also a weekly seasonallity.
Residuals: No observable patterns left in the residuals.

The strong daily seasonality in the series is a pattern worth exploring because it also coincides with business interests defined in the introduction.

Let’s analyse the seasonal pattern more in detail.

Stationary Check

Before we try to apply any models let’s check if the time series are Stationary. Stationary comes in many flavours but here we will use the following definition: A time series is stationary if a shift in time doesn’t cause a change in the shape of its distribution. As a result of this the mean, standard deviation are not time dependent.

Fluctuating rolling mean and standard deviation can be first indication of Non-stationary time series.

Image by the author.

Observations

We can see that the mean and the variance of time series are not constant over time.
Mean and the variance seem to follow weekly seasons.

Judging from. the plot series do not look stationary. With this in mind lets perform two statistical tests to discover if series have unit root or if they are trend-stationary.

Augmented Dickey Fuller Test
---------------------------------------------
ADF Statistic: -3.464712
p-value: 0.008941
Number of lags used: 24
Number of observations used: 1430
T values corresponding to adfuller test:
1% -3.434931172941245
5% -2.8635632730206857
10% -2.567847177857108

Kwiatkowski-Phillips-Schmidt-Shin test
---------------------------------------------
KPSS Statistic: 0.396855
p-value: 0.078511
Number of lags used: 20
Critical values of KPSS test:
10% 0.347
5% 0.463
2.5% 0.574
1% 0.739

ADF & KPSS Test Results

Since ADF Statistic -3.46 < -3.43 and p-value: 0.0089 < 0.05 we can reject the N0 hypothesis in the favour of Na
Since KPSS Statistic 0.396 < 0.463 and 0.078 > 0.05 we fail to reject the N0 hypothesis.

Based on these results we can conclude that:

According to ADF test our series have no unit root
According to KPSS test our series are trend-stationary
This confirms our observation from the graph above where rolling mean and std are following a weekly trend.

Modelling

Now that we established that series are trend-stationary we can start modelling. Considering relatively small sample size, the fact that the dataset captures a seasonal time-series and the number of variables under examination, SARIMAX seems like an adequate model for the task.(see more details in code) If a simple model can predict the target variable well, then its prediction will depend on less variables and thus it will be a more general model. Therefore, for now we will not look for more complicated models that might be a better fit.

In order to choose the right SARIMAX hyper-parameters let’s plot the Autocorrelation and partial autocorrelation function plots.

ACF & PACF

ACF and PACF will help us to identify the lags that have high correlations.

Image by the author.

Observations:

ACF

As we already knew our series are seasonal and our ACF plot confirms this pattern. If we plot more lags we will also observe that significance of the lags is gradually declining.
First significant lag is lag 1. Which is not surprising. The number of daily orders raises/decreases gradually from hour to hour. Hence the orders during the previous hour might tell us something about orders during the current hour.
Netx important lags are at lag 12 and 24. These are deterministic seasonal patterns connected with day/night cycles. 12 hour lag is negatively correlated because when at 8:00am number of orders starts to increase at 20:00pm the number of orders is already decreasing. However, 24 hour lag shows that number of orders made today at 16:00pm might hint about the number of orders to be made tomorrow at 16:00pm.

PACF

With PACF we can see that lag 1 and 24 have the highest correlation. This means that seasons 24 hours apart are directly correlated regardless of what is happening in between.

We can take a look at Lags of interest more in detail:

Image by the author.

Observations:

With lags 1, 12 and 24 we confirm the correlations shown in ACF plot. Positive linear correlations can be seen in lag 1 and lag 24 and a negative non-linear correlation in lag 12.

(see more details in code)

Armed with this information lets start forging the model!

SARIMAX

The SARIMA model is specified:

SARIMAX(p,d,q)×(P,D,Q)s

Where:

Trend Elements are:

p: Autoregressive order
d: Difference order
q: Moving average order

Seasonal Elements are:

P: Seasonal autoregressive order.
D: Seasonal difference order. D=1 would calculate a first order seasonal difference
Q: Seasonal moving average order. Q=1 would use a first order errors in the model
s Single seasonal period

Exogenous variables

X: we create an exogenous variable that would simulate the rising number of orders throughout the week. To do this accurately we can measure the average number of orders by weekday and create a feature that can representing the day of the week. (see more details in code)

Parameter Estimation

We will use Box–Jenkins method for identifying parameters for SARIMA. There are of course more modern approaches like grid-search and auto.arima() but the reason we use Box–Jenkins is that it forces you to understand your model as opposed to brute forcing your way through the parameters. First we will try to estimate the parameters based on our theoretical understanding of ACF and PACF plots to ensure that we understand our model. Next, we will cross check our values with the results of grid search.

Theoretical estimates:

s: In our ACF plot there is one peak and one valley every 24 hours. Thus, we can set seasonal period to s = 24. This also backed by our subject matter knowledge.
p: We are dealing with a gradual change where yt−1 is not drastically different from yt hence the trend autoregressive order will be set to p = 1. This is also confirmed in the ACF plot where yt−1 is the first significant lag.
d: We established that our series are trend-stationary we will set trend differencing to d = 1
q: Based on our PACF correlations we can set q = 2 since its the most significant lag.
P: P = 2 will allow us to use the first and second seasonally offsets (24) in the model, e.g. t−(s×1)=t−(24×1), t−(s×1)=t−(24×2)
D: Since we are dealing with seasonality we can use first degree seasonal differencing D = 1
Q: The seasonal moving average will be set to Q = 2. In this case the model will take into account the moving average of lag t−(24×1) and t−(24×2) as shown in our PACF graph these lags have a significant correlation.

SARIMA(1,1,1)×(2,1,1)24

Train

Now let’s train the model! 🚀

SARIMAX Results                                      
==========================================================================================
Dep. Variable:                             ORDERS   No. Observations:                 1075
Model:             SARIMAX(1, 1, 2)x(2, 1, 2, 24)   Log Likelihood               -3118.452
Date:                            Mon, 07 Feb 2022   AIC                           6254.904
Time:                                    15:31:00   BIC                           6299.513
Sample:                                08-01-2020   HQIC                          6271.818
                                     - 09-15-2020                                         
Covariance Type:                              opg                                         
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
weekday_exog     0.8400      0.138      6.073      0.000       0.569       1.111
ar.L1            0.4937      0.063      7.783      0.000       0.369       0.618
ma.L1           -1.1574      0.079    -14.569      0.000      -1.313      -1.002
ma.L2            0.1579      0.070      2.266      0.023       0.021       0.295
ar.S.L24         0.8870      0.052     17.011      0.000       0.785       0.989
ar.S.L48        -0.2624      0.024    -10.926      0.000      -0.309      -0.215
ma.S.L24        -1.7488      0.052    -33.725      0.000      -1.850      -1.647
ma.S.L48         0.7709      0.051     15.025      0.000       0.670       0.871
sigma2          20.6057      0.886     23.252      0.000      18.869      22.343
===================================================================================
Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):               339.23
Prob(Q):                              0.98   Prob(JB):                         0.00
Heteroskedasticity (H):               1.34   Skew:                             0.54
Prob(H) (two-sided):                  0.01   Kurtosis:                         5.57
===================================================================================

Image by the author.

Observations:

The standardized residual plot: The residuals over time don’t display any obvious patterns. They appear as white noise.
The Normal Q-Q-plot: Shows that the ordered distribution of residuals follows the linear trend. However the curving in the plot suggest heavier tails in our distribution.
Histogram and estimated density plot: The KDE follows the N(0,1) distribution however it has a positive kurtosis.
The Correlogram plot: Shows that the time series residuals have low correlation with lagged versions of itself. Thus, there are no significant patterns left to extract in the residuals.

Evaluate

Image by the author.

Akaike information criterion | AIC: 6254.90413501809
Bayesian information criterion | BIC: 6299.513044006454
Mean Squared Error | MSE: 22.106947769139186
Sum Squared Error | SSE: 23764.968851824626
Root Mean Squared Error | RMSE: 5.132799959869062

The key metric to pay attention to here is:

Root Mean Squared Error | RMSE: 5.132799959869062

Root Mean Squared Error shows us that on average our model makes a mistake in predictions of 5 orders. Depending on our goals we can optimise the model for a smaller error however for the purposes of this article this is an acceptable result. We can proceed to the forecasting.

Forecast

Image by the author.

Conclusion

In this notebook we explored the provided dataset and implemented a SARIMAX model. The model captured the two daily spikes but not when these spikes crossed the mark of 40 orders. It accurately replicated the gradual rise of orders throughout the week. However, the model predicts orders during the night hours that are very unlikely to occur. It seems that the model would benefit from another Exog variable that would apply hourly weights for each hour of the day. We can see that our model minimised RMSE down to 5 orders.

Further development

There are multiple directions we can move from here. If we stick with current SARIMAX model we could:

We can account for public holidays and other anomalies.
We can incorporate monthly and seasonal(summer/winder) trends.
Data can be be spatialized and the models can be applied on a more granular level i.e. neighbourhood, restaurant, user category.

Naturally all will require more data and features. However, if new data and features are added they should be in some form of interpretable causal relationships with the target variable. We had more data in our sample dataset but it did not seem to be directly applicable because establishing the causal relationships between these variables would not have been possible.

Alternatively we can try to apply different models. One candidate could be Exponential smoothing, although it will probably struggle with multi seasonal dataset like this. If we had more data available we could apply Propet, XGboost or even dive into the rabbit hole of neural networks. But these topics are a discussion another time :)

About the Author

Code behind the analysis:

GitHub - Geometrein/sarimax: Quick implementation of ARIMA and SARIMAX models on dummy delivery orders data.

Cities, Maps, Dashboards

Tigran Khachatryan — Mon, 27 Sep 2021 12:32:13 GMT

Humans have been recording and analysing data for centuries. Writing, for instance, was developed in ancient Mesopotamia around 3100BC because bureaucrats needed an efficient tool to record and track citizen information. Since The Babylonian Empire governments have held censuses to gather huge datasets on their citizenry, livestock and resources for taxation purposes. Putting available resources on record enabled more efficient management and made the growth of states larger than the old city-states possible. In turn, larger states started to envelop larger territories and more diverse resources leading to even larger census data. This positive feedback loop went largely uninterrupted for centuries planting the seed for what today we recognise as statistics.

The word statistics ultimately derives from the modern Latin term statisticum collegium (“council of state”) and the Italian word statista (“ statesman” or “ politician”). The German Statistik, first introduced by Gottfried Achenwall (1749), originally designated the analysis of data about the state, signifying the “science of state” (then called political arithmetic in English)

While the science of statistics has changed since the 18th century the abstract constructions we use to store records have remained relatively unchanged. One of the well-known forms of data representation is the tabular structure invented by the Babylonians and still widely used today. In tabular representations data is usually organised into a grid structure that shows the relationships between rows and columns of entries. This structure is incredibly efficient in relating numerical entries to each other but for the ancient “science of the state”, this enumeration was merely means to an end. There was still a need to relate the numerical relationships to the experiential field i.e. three-dimensional space. However, at the time the predominant methods of representing spatial relations were vastly different from the methods for representing numerical relations. Spatial relationships were defined with local maps (including architectural drawings) that utilised the knowledge of geometry at the time. As a result of this difference in the forms of representation relating census data to spatial maps was done by encoding and naming conventions that related the label of the territory with the data associated with it. While the boundaries of the spaces were defined separately on maps and drawings using geometry. This practice of isolating spatial and non-spatial information survived the test of time, largely thanks to its simplicity, and today it can be found at the foundation of a great number of modern technologies.

Top Left: Numerical relations Top Right: Spatial relations: Giambattista Nolli, fragment from the Map of Rome (1748). Images by the author

Despite the fact that both tables and maps have evolved today, the underlying principles behind them altered only marginally. What has changed dramatically is the way humans interact with information. Already in the 18th century, it was evident that while tabular structures were efficient for storing the information they were not ideal for viewing it. Mathematician, political economist and architect William Playfair, when he was not busy with collapsing the French national currency, came to the same realisation and developed methods of summarising and displaying data. Often credited as the father of infographics his diagrams(bar chart, piechart) can be found nearly in every corner of the internet where the speed of delivery has become a crucial requirement.

His diagrams allowed viewers to gain insights into the underlying data at a glance. They summarised the main trends and delivered information faster in a way accessible to many.

Dashmap

With these questions and considerations in mind, I developed an open-source web platform that gathers, analyses and visualises urban data In Helsinki metropolitan area. It utilises the interactivity of web technologies to relate spatial and tabular data at the same time presenting insights in the form of accessible dashboards.

Dashmap Layout Concept Sketch. Image by the author

The platform attempts to address a fundamental problem present in modern open data ecosystems.

Problem: Modern cities have a rich open data ecosystems however the lack of technical skills often prevents people from using the available data.

An average citizen creates 160 gigabytes of data per day. Modern cities and corporations have gotten very efficient in capturing and analysing this data. For a variety of technical reasons the data produced by citizens is often available in complex file format and is accessed through convoluted API requests. This is great for machines but looking at this data will often make your eyes bleed. DashMap.io empowers non-tech users by providing access to open data through interactive and accessible dashboards.

Image by the author.

The layout of Dashmap.io features a dashboard juxtaposed with an interactive map. The map can be used to filter data by postal areas and the dashboards are updates based on the selections made on the map.

Dashmap.io. Image by the author.

The website organises data into 5 main categories.

Census:

A census is a complete enumeration of population and its vital characteristics. Censuses are created by systematic recording and aggregation of data about the members of a given population. The purpose populations census is to understand the basic structure of the society and identify emerging patterns and trends.

Real Estate:

Real estate is a type of real property consisting of land along with any permanent improvements attached to the land. The included resources can be mane made or natural. They can include water, trees, crops minerals and built structures. Often the value of real estate is one of the key indicators of an economy’s health.

Services

A service is a transaction in which no physical goods are transferred from the seller to the buyer. Services may be defined as acts or performances whereby the service provider provides value to the customer using resources, skill, ingenuity or experience.

Mobility

Geographic mobility is the measure of how populations and goods move over time. Population mobility has a large impact on many sociological factors in a society and has implications ranging from impacts on local economic growth to housing markets and demand for regional services.

Environment

The environment refers to the environmental conditions created as byproduct of manmade and natural processes. It includes metrics like pollution, noise, wind patterns and radiation levels.

Each tab contain multiple sections providing insights based on a higher granularity data.

Open source

As mentioned before the purpose of DashMap is to empower individuals and business to utilise open data in their decision making process. Therefore it is crucial that the platform is open source. The whole codebase behind Dashmap can be found in here:

GitHub - Geometrein/dashmap.io: DashMap is an open source web platform that gathers, analyses and visualises urban data.

Contributions and pull requests are welcome!

Conclusion

Open data ecosystems will continue to grow in the future. However, open data should benefit everyone not just companies who monetise data platforms. Therefore more platforms are needed that can empower individuals and business to utilise open data without paying the huge upfront costs associated with app developing. Dashmap provides this platform for Helsinki metropolitan area. If you would like to make one for your city feel free to use Dashmap as a launching platform.

About the Author

Cities, Maps, Dashboards was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Helsinki City Bikes: Network Analysis

Tigran Khachatryan — Mon, 17 May 2021 12:02:32 GMT

Part II

The first part of this article looked at the Helsinki city bike system through the lens of descriptive statistics. In this article, the city bike system will be analyzed as a complex network.

Introduction

For the past decades, transportation systems are commonly analyzed as networks. This abstraction allows to reduce the amount of information available and focus primarily on the internal structure of the underlying systems.

LEFT: Geographic Map of Königsberg. RIGHT: Camillo Sitte, The Birth of Modern City Planning (1986)

This idea is certainly not novel in mathematics or urban planning. However, the recent advances in computational technologies and GIS frameworks enabled new pragmatic applications of graph-theoretical ideas in the field of geographic mobility.

Within the context of transportation, the term network refers to the framework of routes within a system of locations, identified as nodes. A route is a single link between two nodes that are part of a larger network that can refer to tangible routes such as roads and rails, or less tangible routes such as air and sea corridors.

Helsinki City Bike Network 2020. Image by the author.

Transportation networks normally denote either permanent physical tracks such as roads and railroads or scheduled services. Unlike collective transportation(scheduled buses and trains), individual transportation (walking, bike-sharing) occurs on-demand and is more spatially flexible. Thanks to this, systems of shared mobility have a huge potential for self-organization. It is crucial to highlight that edges of the city bike network are not predetermined, they are generated by the users. Thus, the structure of the network emerges from the flow of people through the city. Reoccurring trips made by the residents gradually formulate usage patterns and create the structure of the bike-sharing network. This information, in turn, can be used in a feedback loop for the evolution of the network. Rising or decreasing bike demand can provide insights into wider sociological, economical and geographic factors in certain areas. The graph below illustrates bike usage in different areas of Helsinki during 2020.

Departure stations by usage in 2020. The interactive version can be found here. Image by the author.

As expected the central Helsinki is the most active region in terms of geographic mobility. However, as pointed out in the first part of this article, the geographic centre of Helsinki and the centre of Helsinki bike network are not the same.

Where is the centre of the Helsinki city bike network?

Network Centrality Measures

Complex networks by definition are highly heterogeneous structures. This often results in some parts of the network being more information-rich than others. For example, in social networks, some individuals might have a large number of connections and can spread information faster than others. Hence, within the context of social network analysis, the nodes that represent them are considered more important(central). Within the context of transportation networks, an urban region where the influx of people is higher than in the other areas can be considered central. However, the centrality of a given network may change over time as a result of the growth and evolution of the underlying system. Thus, the definition of centrality is not absolute but rather depends on the specific context and the purpose of the abstraction. Taking this relativity into account, several centrality measures have been proposed that focus on different types of relationships between the nodes.

Degree Centrality

CD(i) degree centrality for node i

Degree centrality is the most simple centrality measure to understand and compute. “Degree” of a node refers to the number of nodes that a given node is connected to. Within the context of city bikes, this refers to the number of bike stations that users have travelled to from the target station. As we can see in the graph below stations located in central Helsinki have a higher number of connections. Closer to the periphery of the network the degree of the nodes gradually decreases.

Image by the author.

The average degree of a node in the Helsinki city bike network is 162. This means that a bike station on average is connected to 162 different stations. The stations with the highest centrality degree are:

Haukilahdenkatu:    312 connections
Paciuksenkaari:     272 connections
Huopalahdentie:     267 connections 
Laajalahden aukio:  262 connections
Munkkiniemen aukio: 262 connections
Töölöntulli:        260 connections
Tilkanvierto:       259 connections
Paciuksenkatu:      258 connections
Pasilan asema:      258 connections 
Esterinportti:      256 connections

The fact that Haukilahdenkatu station is at the top of this list suggests that a large number of citizens from different districts make regular trips to and from this station.

Image by the author.

This could potentially mean that this station is either next to a large transportation hub or an important point of interest. Quick google search will reveal that it's located across the street from a large educational institution and a work rehabilitation centre. This explains both the regularity of the trips and the inflow of people from different suburban areas.

Betweenness Centrality

Where gjk is the number of geodesic paths linking units j and k, and gjk(i) is the number of those geodesics on which unit i occupies an intermediary location.

The betweenness centrality captures how much a given node is “in-between” others. This measure calculates the shortest paths between all the nodes of the network and assigns each node a measure based on the number of shortest paths going through the target node. The graph below illustrates the city bike stations by betweenness centrality.

Image by the author.

The node with the highest betweenness centrality represents the Lehtisaarentie station. This is a well-saturated example of geographic constraints affecting the structure of spatial networks. The station is located on the Lehtisaari island that links the Aalto University campus to Munkkiniemi and Helsinki. Due to the specificity of the Helsinki archipelago, traffic flows from one municipality to another are mostly channelled through island-bridge systems that lie in between. This makes both Lehtisaari and Lautasarri islands crucial traffic nodes connecting the municipality of Espoo to Helsinki.

Closeness Centrality

Closeness centrality is a way of detecting nodes that are able to distribute flows efficiently through the network. The closeness centrality is calculated as the normalized average of all of its geodesic distances. Within the context of the city bikes, a node with high closeness centrality will indicate stations that can potentially act as versatile intermediary stations within the network.

Image by the author.

As illustrated before a large amount of station is directly connected to Haukilahdenkatu station. This leads to Haukilahdenkatu station having the highest closeness centrality within the network. Other nodes with high closeness centrality are located in the areas of Töölo and Pasila. Closeness centrality might be interesting metrics for network optimization, especially for larger networks. In larger bike-sharing networks users might be motivated to make stops at the intermediary bike stations to avoid extra charges. However, as we have seen in the previous part of this article long trips are not common in Helsinki.

Eigenvector Centrality & PageRank

Eigenvector centrality measures a node’s importance within the network while giving consideration to the importance of its neighbours. Connections to high-scoring eigenvector centrality nodes contribute more to the score than equal connections to low-scoring nodes. In other words, a node with a large number of connections may have a low Eigenvector score if all of its connections are with low-score nodes. A popular variation of eigenvector centrality is Google’s PageRank. In essence, PageRank is a variation of normalized eigenvector centrality combined with random jumps. Thus it is not surprising that Pagerank yields result comparable to Eigenvector(graphs below). With this method, stations like the Central Railway Station, Pasila, Lehtisaarentie and Haukilahdenkatu appear as central nodes within the network.

Image by the author.

The application of eigenvector and PageRank in an urban context is a subject of ongoing debate. It has been suggested that these metrics applied in an urban context often lead to misleading results since they ignore a multitude of urban qualities that might contribute to the importance of the node within the city. While these limitations might be a problem in collective transportation systems where the routes of the services are predetermined, this is certainly not the case with individual transportation. As mentioned before, the structure of individual transportation networks is generated by the users therefore it already has some information about the urban qualities embedded in it. The fact that users have a choice and choose to take one path instead of the other creates a trivial voting mechanism that rates different areas of the city. Thus, in this kind of networks, the study of the structure of the network is also a study of user behaviour and their preferable urban qualities. Additionally, Eigenvector centrality in individual transportation networks allows highlighting not only important singular nodes but also geographically important areas within the city. This is possible because it acknowledges not only the importance of individual station but also the importance of stations adjacent to them. Eigenvector centrality illustrates that importance in the city bike stations(nodes) is derived no so much from their inherent attributes(location, station size) but rather from their relationships with other nodes within the network. Central stations together with nodes adjacent to them form small enclaves of activity. The identification of this kind of groupings of nodes within the context of complex networks is commonly referred to as community detection.

Community Detection

Communities in networks refer to groups of nodes that are densely connected internally. Community detection is often a crucial process for understanding the structure of complex networks. In the case of the city bike network, community detection can help to better understand bike usage patterns and determine adequate pricing models. There is a number of algorithms available for community detection.

Combo Method

The majority of exiting community detection strategies iteratively perform one of the following operations: merging two small communities (agglomerative), splitting a large community into two(divisive), moving nodes between two distinct communities. The Combo algorithm developed at MIT sensible city lab compounds all three strategies into a sequence. After creating an initial single community, for each source community, the best possible redistribution of all source nodes is calculated then the best grouping is applied. These steps are iteratively repeated until the given fitness criteria are achieved. The graph below illustrates the Combo method applied to the Helsinki city bike network.

Combo method for community detection applied to the Helsinki City Bike Network. Image by the author.

This grouping reveals two distinct communities that are separated approximately in line with the exiting municipality divisions. This is perhaps not extremely informative since this division could have been assumed merely by looking at the map. Based on the first part of this article, intuitive understanding suggests that bikes are more likely to be used inside the municipalities on shorter distances. Therefore if other communities exist they should be found on a more local scale.

Louvain Method

Since the previous community detection method did not yield insightful results, we can apply a different method that specifically focuses on the modularity of the network. The Louvain community detection is a good candidate due to its beautiful simplicity and computational efficiency. It efficiently handles the NP-complete computability problem of maximum cuts in Graphs that is present in all community detection tasks. Applying the Louvain method to the Helsinki city bike network reveals four distinct communities.

Louvain method for community detection applied to the Helsinki City Bike Network. Image by the author.

This means that bike usage inside these four communities is higher than the bike usage between these communities. Nodes with high betweenness centrality are usually the places where the communities might be divided into modules. If we return to our betweenness centrality graph this relationship will become even more evident. Boundaries of the communities pass roughly through the nodes with high betweenness centrality. Also, this division doesn’t come as a surprise taking into account the geographic character of the Helsinki archipelago. Landmasses and their divisions by large bodies of water significantly affect community formation. It is important to highlight that all transportation networks are spatial networks and their structure and evolution are closely intertwined with physical constraints. An interesting discovery is that the Leppavara area, despite being part of Espoo municipality, is more densely connected to Munkiniemi and Pitäjänmaki than to Otniemi or Tapiola areas. A similar tendency can be seen in some stations around Sörnäinen where stations are more densely connected to the Herttoniemi area.

Fluid Communities Method

The Fluid Communities algorithm is a community detection algorithm based on the idea of introducing a number of fluids within a non-homogeneous environment, where fluids will expand and push each other influenced by the topology of the environment until a stable state is reached. The algorithm does the following. First, it defines each of the starting communities in a random node of the graph. Then the algorithm iterates over all nodes in a random order, updating the community of each node based on its own community and the communities of its neighbours. This process is performed iteratively until convergence.

Fluid method for community detection applied to the Helsinki City Bike Network. Image by the author.

We can see that the Fluid community detection method confirms the communities identified with the Louvain method and identifies one additional community.

Eastern Espoo(Magenta Community) is a distinct community shaped by geographic constraints and extending along the metro line towards the West.
Leppävaara, Pitäjanmaki, Munkiniemi, and Etelä Haaga (Green Community) form another cycling community despite the municipality divisions separating the districts.
The areas of Vallila, Kapyla and Oulunkylä form another group(Pink Community) separated from the Green community by Central Park. This community is stretched along the railroad in northern Helsinki.
The southern part of Helsinki (Blue Community) includes the geographic centre of Helsinki and major stations such as Kammpi, central railway station and extends towards the southern shoreline.
Meanwhile, Herttoniemi is at the centre of another enclave (Yellow Community) that extend from Sornainen to Vuosaari along the metro line.

Conclusion

In this part of the article, the Helsinki city bike network was briefly analyzed as a complex network. Multiple centrality measures were applied to the network and important centres were identified. Tree community detection algorithms revealed different possible groupings of the nodes within the network. The next part of the article will take a look at how machine learning methods can be applied to further enhance understanding of relationships within the network.

About the Author

Jupyter Notebook

The code behind the article can be found here:

GitHub - Geometrein/helsinki-city-bikes: Helsinki City Bike system analysis.

Kaggle

References

Helsinki City Bikes: Network Analysis was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.