BlackRock Engineering - Medium

From Survival to Leadership: My BlackRock TLx Story

BlackRockEngineering — Tue, 20 May 2025 17:46:27 GMT

From survival to leadership: My BlackRock TLx story

BlackRock Product Manager, Jamal Smith, shares his personal career journey and growth as a leader — through the talent program “TLx”

By: Jamal Smith, Senior Product Manager for Aladdin UX at BlackRock

There’s a moment in every career when you realize you’re standing at a crossroads. For me, that moment came during the height of COVID, when it felt like everyone in tech was questioning their path. Colleagues were departing for lucrative opportunities elsewhere, FOMO was running high, and the “Great Resignation” was in full swing.

Like many others, I found myself wondering: should I follow the well-worn path to another firm, or was there something deeper worth building where I was?

But to appreciate that decision, you first need to understand how I got there. You see, my journey to BlackRock was anything but traditional. After law school and at the height of another “GR” — this time, the “Great Recession” — I unexpectedly found myself jobless, homeless, and living out of my car to get by. I had just moved to Miami, excited for a new career and chapter, but was laid off within a few short months and joined the millions of displaced workers facing mounting bills and no prospects. Though I eventually got back on my feet, finding a salaried position after a long stretch of hourly jobs, my confidence in myself seemed irrevocably shaken.

The path from there to a career in finance wasn’t just unlikely — it felt impossible. When I first applied to BlackRock on the recommendation of a friend, I remember viewing it as some interesting interview practice, nothing more. I didn’t know anything about finance beyond that I was sure I wasn’t worthy to be here. I remember walking into the office for my final interview, wearing a suit I’d bought the night before, feeling like an impostor in the heart of Wall Street. But BlackRock surprised me. Despite my unconventional background, despite my self-doubt, they saw something in me that even I couldn’t see in the mirror. They didn’t just give me a job — they gave me a chance to build my life anew. And they kept investing in me.

About a year in, I remember being surprised at receiving a particularly significant compensation boost. My manager looked at me and said something I’ll never forget: “We pay for performance here — never underestimate your worth.” She saw my hesitation and took the time to have a deeper conversation about value and self-advocacy. She taught me to understand my contributions, especially as a person from marginalized groups, and to never be afraid to ask for what I wanted. “You’ll only ever get three answers,” she said, “yes, no, or not right now — and then the power is in your hands.

Fast forward to 2020 and I found myself facing that critical decision. Was it time for me to explore new opportunities, like many friends across my industry were doing? But something held me back — not fear, but instead a growing realization that I needed clarity before I could make a meaningful choice about my own path ahead.

That’s what led me to TLx.

The TLx Difference

BlackRock has always invested heavily in developing its people — from rich online academies to mentorship programs to leadership training. But I was drawn to our Tech Leadership Accelerator (‘TLx’), an intensive and interactive six-month talent program, for employees in technology roles, designed to provide a comprehensive learning and networking experience. TLx is an application-only program and has been oversubscribed for all 7 cohorts since launching. I sensed TLx might help me answer the deeper questions I was wrestling with: What kind of leader did I want to become? What does tech leadership even mean, particularly in a place like BlackRock? What skills did I have to develop and harness?

Instead of all the answers, what I discovered through TLx is that there is no single path to tech leadership — and that’s actually the point. Leading in technology is already a unique challenge — you’re navigating an industry that transforms almost daily, trying to bridge the gap between deep technical expertise and broader leadership skills. But at BlackRock, we’re doing this at the intersection of finance and technology, where the stakes couldn’t be higher, and the complexity multiplies with every new market innovation.

TLx doesn’t just acknowledge these layered challenges — it embraces them, helping us develop the adaptability and breadth of vision required to lead effectively in this complex intersection. The program combines personal coaching, master classes with industry leaders, and small sessions with BlackRock’s top executives. But what makes TLx unique is how it addresses the fundamental transformation that happens when a technical expert begins to step into leadership. It’s not just about adding management skills to your technical toolkit, much like a sales person becoming a CRO or an engineer becoming a CTO. Your technical skills remain vital, but they’re no longer your exclusive identity.

Beyond Technical Skills

So what other skills did I learn? A key one was about emotional regulation — something I hadn’t expected to focus on in a tech leadership program. I learned that as a leader, your energy sets the tone for your entire team. If you’re having a bad day, everyone has a bad day. If you’re excited about a project, that enthusiasm ripples through the organization. Managing this dynamic isn’t just a soft skill — it’s a crucial leadership competency.

I also learned that leadership is about crafting and telling compelling stories. Whether you’re seeking resources, driving change, or building consensus, your ability to communicate a vision matters as much as your technical capabilities. At BlackRock, where we’re constantly pushing the boundaries of what’s possible in fintech, this ability to translate technical complexity into compelling narratives is essential.

The Power of Perspective

But perhaps the most transformative aspect of TLx was how it helped me reframe my own story. During one of our sessions with an external speaker — this time an unlikely CEO who had also experienced homelessness — I had a revelation. She shared how she had reframed her past struggles not as limitations, but as proof of her resilience. “You’ve survived the worst,” she said. “What do you have to fear?… Why not take the risk? Why not raise your hand? Why not try?”

That perspective shift changed everything. I stopped seeing my past as a liability and started recognizing it as a source of strength. It gave me the courage to take risks, to vocalize my interest in new opportunities, to step fully into leadership.

A Community of Leaders

The power of TLx extends beyond individual transformations. The program brings together people from across BlackRock who might never otherwise connect: veterans who’ve been with the firm for years, newcomers bringing fresh perspectives from other industries, colleagues from different regions of the world. This diversity of experience and viewpoint is intentional, and it creates something special.

I met colleagues I didn’t even know were in my department until we met in TLx. Today, I regularly schedule one-on-ones and check-ins with people I met through the program who work in entirely different parts of the business. These connections have opened my eyes to how similar our challenges are, despite our different roles. As someone in a unique position — not quite a traditional engineer nor product manager — I had often felt somewhat isolated. TLx showed me that many of the leadership challenges I face are universal, even if our day-to-day work differs.

These relationships have broken down silos, sparked collaborations, and created a network of support that extends far beyond the six-month program. When you’re trying to navigate a career in tech leadership — especially in finance — having a community of peers who understand your challenges is invaluable.

The Choice to Lead

Ultimately, the most important lesson I learned through this journey is that your career is your choice — and it’s one you get to make every single day. BlackRock and TLx helped me understand that being the head of a massive organization isn’t the only path to leadership; you can choose to be a senior IC, a line manager, a CTO or anything in between and every one of those is equally valid. What matters is making it a conscious choice, fueling your direction with intention, and understanding the why behind your path — even as it twists and turns.

Today, I’m grateful I chose to stay and grow at BlackRock. That moment of COVID-induced FOMO led me to something far more valuable than a short-term career move — it led me to clarity about who I am as a leader and where I want to go. Far beyond a sharper set of skills, I’ve found my voice and my purpose.

Learn more about technology careers at BlackRock.

From Survival to Leadership: My BlackRock TLx Story was originally published in BlackRock Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

AI is making engineering more human, not less

BlackRockEngineering — Mon, 21 Oct 2024 20:10:55 GMT

By: Lance Braunstein, Head of Aladdin Engineering at BlackRock

When people talk about AI and jobs, the conversation often veers towards a dystopian future where robots have taken all our work. Putting aside my de facto disclaimer as a technologist that artificial intelligence has been around for decades (and has been changing jobs all the while), it’s an understandable fear, especially in technical fields like software engineering. After all, if AI can write code, detect and debug errors, automate testing, and facilitate DevOps, what’s left for humans?

As the Head of Aladdin Engineering at BlackRock, I’m not just theorizing about AI’s impact on jobs — I’m living it. And what I’m witnessing isn’t job displacement, but a profound transformation of the engineering landscape.

Let’s start with a counterintuitive truth: AI is making engineering more human, not less.

While AI excels at generating the first draft of code, the essence of software engineering lies in problem-solving, collaboration, applying context, and seeding innovation. These deeply human skills are becoming even more crucial in our current era, as engineers tackle increasingly complex challenges that machines alone can’t solve. While we race to adapt to AI, we’re simultaneously confronting blockchain, cloud computing, and a host of other globally disruptive technologies evolving beneath our feet. Their application and integration require novelty and nuance much more than efficiency.

Artificial intelligence has been reshaping the engineering landscape for years, creating demand for new roles and skills, but perhaps never at the current pace. The call for machine learning engineers, big data specialists, database and network scientists, and statisticians is surging across the industry. The burden on our platforms for resiliency, scale, security, and performance is multiplying as our universe of prompt engineers expands, creating a vacuum for talented engineers. And the field of AI ethics has emerged as critical, requiring a multidisciplinary approach that blends technical expertise with knowledge of ethics, law, and social science.

Even conventional engineering roles are transforming. It’s been years since I, individually, pushed a code change but when we first started experimenting with developer copilots, I called server engineers on the floor into my office to see what I’d produced (and tell me how to make it better). Our software developers have become orchestrators of complex ecosystems, harmonizing human-written and AI-generated code on a platform with thirty years of proud history.

This new paradigm demands not just technical proficiency, but also heightened creativity, sound judgment, and a comprehensive grasp of the problem domain. And every traditional engineering role — from security specialists to application engineers to DevOps — is being reframed with an expectation for understanding how to leverage and adapt to AI pipelines, models, tools, and risk. To say it’s changing the way we hire and how we think about composing teams is an understatement. The bar for talent at BlackRock has always been high but more than ever it’s about left brain, right brain synergy and finding innovative collaborators.

The ripple effects of this AI revolution extend far beyond our engineering teams, touching every corner of BlackRock. Take our recent launch of Aladdin Copilot, for instance, which serves as a connective tissue for our clients to generate insights across our investment platform. It doesn’t replace our client service representatives; instead, the Copilot amplifies their capabilities, enabling them to spend more time understanding and engaging with customer needs and less time executing rote queries.

This pattern of AI-driven enhancement repeats across our organization. While we’ve leveraged AI and machine learning in our investment process for nearly two decades, recent advancement in large language models represent a step-change in our capabilities. In our investment teams, we’re harnessing fine-tuned LLMs to analyze earnings call transcripts and forecast market reactions with unprecedented accuracy. As we build equity baskets in response to dynamic, market-relevant themes, we’re blending proprietary data with the vast knowledge base of LLMs to create a faster and more flexible process than our portfolio managers could execute alone. And in contract management, AI is converting complex legal terms into codified rules for portfolio management, elevating our legal and operations teams to focus on higher-value areas.

Across these diverse applications, a common thread emerges: AI is both handling the routine and augmenting the experts, enabling our human talent to explore new frontiers, craft smarter insights and tackle ever evolving questions.

Of course, this transformation comes with challenges. The pace of change is breathtaking, and skills can become outdated almost overnight. That’s why we continue to make significant investments in ongoing learning and upskilling. BlackRock has emphasized the notion of the citizen developer for years and we recently mandated prompt engineering and AI training for all employees — not just our tech teams. We’ve deployed internal tools like Copilot for Microsoft 365 across our main work systems — word processors, email, intranet, and more — and are actively seeing their adoption grow. In a sense, we’re all becoming engineers now, learning to harness AI to enhance our work.

As AI continues to advance, it uncovers new possibilities, poses fresh questions, and presents novel challenges. Each of these developments calls for human ingenuity. Our ambitions in technology always outpace our current capabilities. We have an insatiable appetite for innovation and progress, constantly pushing the boundaries of what’s possible.

This ever-expanding frontier of technological potential is precisely why AI isn’t reducing our need for talent — quite the opposite. It’s enabling us to tackle bigger challenges, to create more value, and to have a greater impact than ever before. We’re not shedding talent — we’re empowering our people to do more, faster, better, and farther than we ever thought possible.

Check out BlackRock’s open AI engineering roles here.

AI is making engineering more human, not less was originally published in BlackRock Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Solving the Mystery of the Slow Hash Table

BlackRockEngineering — Mon, 29 Jul 2024 18:53:12 GMT

The story of a Java hash table gone wrong and how it was fixed

By: Spenson Shih, Principal Engineer, Aladdin Wealth Tech & Akash Gupta, Senior Engineer II, Aladdin Wealth Tech

One of the most used data structures across programming languages is the hash table. In Java, the HashMap class is a hash table implementation that provides a powerful means to store and retrieve data efficiently. There are also many variations of the hash table implementation both built into the JDK or as a 3rd party extension.

All these various hash tables are built upon the same basic concept. Understanding its inner workings is crucial for any developer seeking to optimize data handling. In this blog, we’ll share with you a case-study of a hash table gone wrong, and how we identified and fixed the problem.

Very Large Portfolios

In the Aladdin Wealth business at BlackRock, we have many kinds of clients and are always evolving and enhancing the platform to support different use cases, especially where large scale is needed. Scale usually comes in the form of many portfolios, sometimes in the millions, but in one case, we wanted to use our platform to analyze a client’s entire book of business as if it were a single portfolio. The largest one of these pseudo-portfolios was composed of nearly 50 million positions from around 2 million portfolios, and over 250,000 unique securities.

We use Apache Spark to distribute workload across nodes for portfolio construction and portfolio analytics computation. This Spark job was expected to take around 4 hours to complete. However, we found that the runtime was actually taking over 16 hours. This was a significant increase and would impact our ability to deliver results for the client in a timely manner.

Troubleshooting this issue in a massive Spark application is like searching for a needle in a haystack. Our team performed all of the usual troubleshooting steps — we analyzed the GC logs, resource utilization in Grafana, enabled debug logs, and so on.

It took a while, but using the data collected from the above, we extracted the execution times of each analytics calculator for various portfolio sizes. The data table below summarizes the execution times in milliseconds of each calculator for various input sizes.

As you can see, the slowness seems to come from the calculation of the Breakdowns. In fact, if you plot the portfolio size vs the time in millis, you will notice that the Breakdowns Calculator seems to have a quadratic time complexity or O(N2) — as the input size increases, the runtime also increases in a quadratic fashion.

JMH Benchmark

We used the Java Microbenchmarking Harness to attempt to reproduce the slowness when invoking Breakdowns Calculator with the same input size of 250,000 positions. Unfortunately, our attempts to replicate the slow execution times using a benchmark were unsuccessful. The benchmark result shows that the calculator completes in approximately 20 seconds for the same input size, compared to over 4 hours in the Spark job.

Position Contributors

Part of the Breakdowns Calculator involves calculating the contribution of each position. We decided to disable the calculation of position contributions to further isolate where the bottleneck is coming from. To also speed up our test runs, we limited the runs to only the portfolio of 250,000 positions, as a single re-run took 16 hours to finish. By disabling the position contributors, the Breakdowns Calculator for the portfolio of 250k position completed in under 20 seconds. We had isolated the problem to somewhere in the position contributor’s code.

Let’s focus on the part that calculates the breakdown position contributors. We use a library called fastutil, an extension to the Java Collections Framework, which provides Collections implementations that avoid the autoboxing and unboxing of primitive to object wrappers in Java that can hurt performance as seen here. The primitive double map is used to accumulate the sum of the position contributions keyed with the unique position identifier and the security identifier:

if (includeContributors) {
contributors.computeDouble(Pair.of(position.getId(), position.getCusip()), (k, v) -> defaultIfNull(v, 0.0) + posContribution);
}

It’s a mystery why this code runs significantly slower compared to our tests using the exact same code and similar sized inputs. The logic is straight-forward, as it simply calculates the sum of the contribution grouped on the position ID. Our benchmarking of fastutil map also shows we can easily build similar sized maps within seconds. Fastutil maps are also used in Risk Calculator — which does not seem to have the same issue as you can see from the timings earlier.

The advantage of a hash table is that the time complexity to insert and retrieve a value is a constant time O(1) on average. So how can the use of a hash table result in an O(N2) quadratic time complexity when used in our code?

Hash Table Basics

Illustration of a classical hash map using separate chaining. Keys (left) are put into corresponding buckets (middle) obtained by the hash function h. A linked list (right) is constructed per bucket to store key value pairs within the same bucket but with unequal keys. source

To answer that question, we need to have a basic understanding of how hash tables work:

· the underlying basic structure is an array of linked lists, representing a bucket

· when performing a lookup or insertion, calculate the hash of the key

· mod the hash to give an integer between 0 and the size of the array of buckets — this is the index of the selected bucket

· iterate through the linked list in the selected bucket, comparing equality of keys

· if keys match, return the associated value with that key

All the above operations: calculating the hash, mod, array indexing, and adding element to the end of a linked list, have constant time complexity or O(1) — which is to say as the hash table grows, the time taken to fetch or store a value does not change. This is assuming you have a good hash function, and your elements are evenly distributed across the buckets of your hash table.

Hash Collision

A hash table lookup operation can be severely degraded if you have a weak hash function. This will result in a lot of hash collisions — when two or more pieces of data in a hash table share the same hash value. In a basic hash table implementation, a bucket usually consists of a linked list. In the event of a collision, the entry will either replace an element in the list or add it to the list. In the worst-case scenario, all entries end up in the same bucket and you end up with a time complexity of O(N) for a lookup or put operation. In Java 8 and above, the HashMap implementation will replace the linked list with a balanced binary tree if a certain threshold is exceeded. This provides a better performance in the case of collision, as lookup and put will become O(log N) instead of O(N). However, better than solving hash collisions is avoiding them from happening in the first place.

With this in mind, we now focused our attention on the key of the map, where we use Apache Commons Lang3 Pair tuple. This is an ImmutablePair class which is a pair of two values — the position identifier and the security identifier. Below is the hash code implementation of the ImmutablePair class:

public int hashCode() {
return Objects.hashCode(this.getKey()) ^ Objects.hashCode(this.getValue());
}

Both the position identifier and security identifier are instances of Java String and will have a good hash function. The hash code of our key is effectively the hash code of the position identifier combined with the hash code of the security identifier using a bitwise exclusive or (XOR) operator:

hashCode(position.getId()) XOR hashCode(position.getCusip()

XOR is a bitwise operator where if input bits are the same, then the output is 0, else 1.

During the ETL phase of the Spark job, we collapsed positions with the same securities into a single item when building the portfolios. Let’s use the below example 3 accounts:

The pseudo-portfolio will be a combination of all 3 accounts, with 9 positions. But by collapsing the same positions into the combined entry, we reduce the number of positions to just the unique tickers, and we end up with a portfolio with 6 entries.

In real world data, this process reduced the number of entries in the largest portfolio from 50 million entries to just 250,000. To identify the contributors of each security, the position identifier was assigned to be the same as the security identifier.

The eureka moment came as soon as we combined this information with the hash code implementation of the Apache Commons Lang3 Pair class. Everything made perfect sense. With the position identifier and security identifier having the same value for all 250,000 entries, all the hash code is calculated as 0, and we end up with a hash collision for every single entry.

hashCode("IVV") XOR hashCode("IVV") = 72905 XOR 72905 = 0
hashCode("AAPL") XOR hashCode("AAPL") = 2001436 XOR 2001436 = 0
hashCode("AGG") XOR hashCode("AGG") = 64737 XOR 64737 = 0
…

The hash collision means every single entry in the map ends up in the same bucket. The computeDouble function becomes O(N) complexity executed within an O(N) complexity of the breakdowns calculator looping through the positions — effectively giving us O(N2) time complexity that we see in our production runs.

This explains why our tests and benchmarks were unable to reproduce the slowness and we never suspected a hash collision, as the position identifier and security identifier are never the same values. The same code running in our interactive servers also did not encounter the same slowness for the same reason. It’s only during the portfolio analysis where the position identifier and security identifier will have the same value and we ended up with the hash collisions.

The Fix

Knowing the reason behind the slow execution, the fix was simple and straight-forward. We updated the map to use just the position identifier as the key, then store the security identifier together with the accumulated position contributors as part of the value instead, by creating a new MutablePositionBreakdownContributor class.

if (includeContributors) {
contributors.computeIfAbsent(position.getId(), posId -> new MutablePositionBreakdownContributor(posId, position.getCusip())).addContribution(posContribution);
}

This change eliminates the hash collision for the portfolios, which means the average time complexity of the map operation is now back to constant time O(1), and our breakdowns calculator now has a time complexity of O(N) for both interactive and batch process.

The data extracted from logs of the latest run after the patch confirms the breakdowns for the very large portfolio now completes in just under 17 seconds even with position contributors enabled. Overall, this change has brought back the runtime of the Spark job from 16 hours back to the expected 4 hours.

Looking back at the commit history of how this came to be, it turns out the change to use the HashMap was part of a previous optimization to make the breakdowns calculator faster. That change involved the use of a cached Spring Expression evaluation context, but a small refactor of the contribution logic was made at the same time to reduce intermediate results. At that time, a benchmark was also performed, but as we now know the benchmark failed to simulate the conditions in the batch Spark jobs that lead to the hash collisions.

Of course, there are many other factors that can affect the performance of a hash table implementation. These are very well-studied problems, and you can find several resources online about this topic.

Lessons learned

In our point of view, we considered the hash function of Apache Commons Lang3 Pair class to be weak, since having the same value for both elements always result in a hash code of 0. The implementation does not violate the hash code contract, but if used as keys on hash tables it will severely degrade performance if both elements happen to have the same value. We considered reporting this as an issue to the Apache Commons team. However, a quick search shows this has already been identified back in 2019 and is still open. The explanation being this is a weakness in Java’s Map.Entry interface hash code contract and needs to be addressed there. In short, do not use Apache’s Pair class or any class that implements the Map.Entry interface as a key in any hash table or hash-based data structure — like a HashSet in Java. As with any hash-based data structures, make sure you fully understand what’s going on inside your hash functions. As you can see, this could mean a difference of several hours of compute time if you are dealing with very large amounts of data.

In conclusion, this case study highlights the importance of understanding the inner workings of data structures, even those provided by the programming language or reliable third parties. By identifying the root cause of the hash collision and implementing a simple fix, the team was able to bring the runtime of the analysis job back to its original duration. This experience also led to the Distributed Compute team to develop a mechanism to monitor Spark jobs and gather detailed telemetry, allowing for easier identification and isolation of bottlenecks in production runs.

Learn more about technology careers at BlackRock.

Solving the Mystery of the Slow Hash Table was originally published in BlackRock Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Unlocking Ambiguity in Semantic Search: Introducing Blowfish

BlackRockEngineering — Fri, 28 Jun 2024 20:26:13 GMT

Aladdin Data Scientists Thomas and Alex recap a recent white paper where they introduce practical insights into topological and statistical signatures for quantifying ambiguity in semantic search

By: Thomas Roland Barillot, Lead Data Scientist, Aladdin Engineering, & Alex De Castro, PhD, Senior Researcher and Data Scientist II, Aladdin Engineering

In the digital age, where data is abundant and readily available, semantic search has become an essential tool for extracting relevant information. However, the challenge of ambiguity in search results persists, complicating the retrieval process. In our recently published paper, “Blowfish: Topological and Statistical Signatures for Quantifying Ambiguity in Semantic Search”, we offer novel and practical insights into this issue.

Understanding Ambiguity in Semantic Search

Semantic search aims to enhance search accuracy by understanding the contextual meaning of terms within the searchable data space. For example, a traditional keyword search for “apple” might return documents about the fruit and the technology company indiscriminately. In contrast, semantic search can differentiate between these contexts based on surrounding content, returning more relevant results. However, ambiguity arises when a search query or the context of documents is unclear or open to multiple interpretations. For instance, a search for “Java” could refer to the programming language, an Indonesian island, or a type of coffee. The paper introduces a method to quantify and analyze this ambiguity using both topological and statistical approaches, providing a clearer understanding of its impact on search accuracy.

Blowfish Framework

The Blowfish framework leverages advanced techniques from topology and statistics to identify and measure ambiguity in semantic search. The core innovation lies in the analysis of sentence embeddings — mathematical representations of sentences that capture their semantic meaning. By examining the homological features, which are essentially the shapes and structures formed by these embeddings in a multi-dimensional space, the authors were able to detect patterns indicative of ambiguity. For example, clusters of sentence embeddings that are closely packed together may indicate a lack of clarity, as multiple meanings are represented similarly. This method provides a novel way to understand and address the challenges posed by ambiguous search queries.

Key Findings

1. Topological Signatures

Ambiguous queries tend to produce more complex topological structures in the manifold of sentence embeddings. To clarify, “topological structures” refer to the shapes and connections formed by data points in a high-dimensional space, and a “manifold” is this high-dimensional space itself where the data points (sentence embeddings) reside. The complexity of these structures can be visualized and quantified, providing a clear distinction from less ambiguous queries. By analyzing these patterns, we can better understand and address the challenges posed by ambiguous search queries.

2. Statistical Analysis

Statistical methods enhance our topological findings by measuring how spread out or clustered the sentence embeddings are. Ambiguous queries often result in a more scattered distribution, showing multiple possible meanings or interpretations. Imagine the semantic space of vector representations as a landscape; near ambiguous queries, this landscape tends to resemble a blowfish, with points spreading out in various directions.

3. Experimental Validation

The paper presents extensive experimental results, showing that ambiguous queries, when analyzed through Blowfish, exhibit distinct patterns that can be effectively identified and quantified.

Practical Applications

The implications of this research are far-reaching. By incorporating Blowfish into existing semantic search systems, it becomes possible to enhance their accuracy and reliability. For instance, in AI-driven applications such as Retrieval Augmented Generation (RAG) systems, better handling of ambiguity can lead to more precise information retrieval and improved user experiences.

Conclusion

Blowfish represents a significant advancement in the field of semantic search, offering a novel approach to tackle the enduring problem of ambiguity. By blending topological and statistical analyses, this framework not only identifies but also quantifies ambiguity, paving the way for more effective and accurate search technologies.

Our findings were driven by commercial applications at BlackRock. While building LLM-powered search systems, we noticed a need to understand why certain queries produced useful results and others did not. Factors characterizing good queries, such as clarity and context, became apparent through experimentation. We realized that ambiguity or insufficient context was a significant driver behind poor results. This led us to develop quantitative methods to detect ambiguity, enabling us to explain and audit our vector search systems effectively.

For a deeper dive into the methodologies and findings, the full paper is available here. This research is poised to influence future developments in semantic search, making it a crucial read for those in the field of AI and data retrieval.

Blowfish is a Python library that quantifies ambiguity in semantic search using a Gaussian Kernel Density Estimation (KDE) model. KDE estimates the probability distribution of data points — in this case, factors based on sentence embeddings — across a continuous space. Blowfish automates embedding and clustering of document chunks, trains on query-answer pairs, and evaluates ambiguity using topological features. Its strength lies in managing complex semantic search cases, contributing to ranking or explanation tasks in Retrieval Augmented Generation (RAG) systems. With flexible model configurations, it provides a robust framework for handling ambiguity and clarity in responses.

Learn more about technology careers at BlackRock.

Unlocking Ambiguity in Semantic Search: Introducing Blowfish was originally published in BlackRock Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Open Sourcing the Aladdin SDK: Empowering Python Developers with a Quantitative Edge

BlackRockEngineering — Tue, 07 May 2024 15:54:21 GMT

By: Vedant Naik, Lead Engineer, Aladdin Studio, and Eli Kalish, Product Manager II, Aladdin Studio

In the rapidly changing landscape of software development, the demand for efficient mechanisms to harness data and execute complex strategies has never been greater. Developers often have diverse sets of skills and varying levels of proficiency, introducing a need for simplicity and standardization around how Aladdin, BlackRock’s investment management software, offerings are consumed. We are pleased to announce the Aladdin SDK, a bespoke Software Development Kit written in Python that provides client-side code for API and Aladdin Data Cloud (ADC) usage, and several common utilities under one convenient library.

Introducing the Aladdin SDK

The Aladdin SDK provides pre-built tools and functionality that simplifies connectivity to Aladdin APIs and the Aladdin Data Cloud, so you can accelerate development on top of Aladdin technology. Whether you’re a professional engineer or a code-proficient business user, the Aladdin SDK provides you with the tools and customization to make Aladdin your own.

Boost code efficiency by minimizing lower-level setup required for API and Aladdin Data Cloud access, and seamlessly handle authentication, request retries, and error handling. With built-in utilities under one convenient library, unlock a wide range of Aladdin capabilities that can be configured specifically for your needs. Streamline your setup and simplify your workflow through common connections, creating a consistent environment that allows you to build safely and securely. And lastly, leverage standardization to simplify code maintenance and accelerate development, so you can focus on delivering impactful business solutions, faster.

The Aladdin SDK aims to empower more practitioners and citizen developers — professionals with financial expertise and some programming proficiency — to boost their productivity, streamline their development, and build more efficiently at scale.

Simplified Access to Aladdin APIs and the Aladdin Data Cloud

Using the Aladdin SDK drastically reduces the amount of technical setup necessary to access Aladdin APIs and Data. Aladdin APIs enable clients to build solutions with access to the breadth and depth of Aladdin services. Aladdin SDK plugins enables service providers and consumers to dynamically update API capabilities. The Aladdin Data Cloud is a managed data store, powered by Snowflake technology, that brings together Aladdin and non-Aladdin data as a time-series on a single analytics platform. Both of these offerings are heavily used within BlackRock by quantitative researchers, software developers and citizen developers.

The Aladdin SDK is a unifying force, providing consistent patterns for developers to interact with curated Aladdin Graph APIs and ADC datasets. Users can access APIs with built-in functionality for OAuth and Basic Authentication, and query Snowflake data with a single line of code.

Configuration Driven Implementation

The Aladdin SDK allows users to define and modify the SDK’s behavior through easily understandable configuration files. This powerful feature further simplifies the implementation process, while enhancing the SDK’s adaptability to diverse scenarios. Developers can tweak parameters, select preferred authentication mechanisms, and fine-tune strategies through straightforward configurations, making the Aladdin SDK a truly flexible and customizable tool.

Under the hood, the Aladdin SDK uses dynaconf for configuration management. Users can provide a configuration yaml file by pointing to a filesystem location using the ASDK_USER_CONFIG_FILE environment variable. Users can override any of the configurable settings by exporting an environment variable prefixed by `ASDK_`. For a full list of available configurations refer to the official Aladdin SDK README.md

Run Environment Customizations via Configurations

The configuration module allows SDK owners, or Development Environment orchestrators, to provide a customized Aladdin SDK experience for their end users. This can be done by setting hard configurations using environment variable overrides, or providing a Default Configuration file at startup.

Additional configurations can always be provided by using ASDK_USER_CONFIG_FILE to point to a user created file. This will override any default configurations.

Configurable settings set via environment variables take priority and override all other configurations.

Common Development Utilities

Having a standard approach for common tasks ensures everyone is building using the same patterns. To achieve this, the Aladdin SDK abstracts away actions such as authentication, retry, error handling, data transformations, file exports, and more. The user can provide configurations that dictate the behavior of each action, rather than having to worry about implementing these from scratch for different projects. This allows all users to simply instantiate an API or ADC client object and proceed with their development tasks.

The Aladdin SDK in Action

Making an API call

Let’s take an example of making an API call to see how all the above points come together in the real world. We will do this while comparing and contrasting the Aladdin SDK with the current API usage experience.

For this example, we want to invoke an endpoint on Order API. Specifically, the ‘orders:filter’ endpoint:

# import libraries for API Calls / Data Analysis
import json,uuid,datetime,requests,pprint,math,csv
import pandas as pd
 
# set Client Environment (from url), Aladdin Username & Password, and APIKey
client_env = ''
username = ''
pwd = ''
APIkey= ''
 
today = datetime.date.today().strftime('%Y-%m-%d')
         
#function to generate unique headers when called
def generate_headers():
    headers = {
        'VND.com.blackrock.API-Key': APIkey,
        'VND.com.blackrock.Origin-Timestamp': 
            str(datetime.datetime.utcnow()
                .replace(microsecond=0).astimezone().isoformat()),
        'VND.com.blackrock.Request-ID': str(uuid.uuid1())
    }
    return headers
     
#populate filter parameters
 
params = {
  "query": {
    "portfolioGroupCriterion": {
      "portfolioGroupTicker": "TST-TRD1"
    }
  }
}
 
url = f'https://'+client_env+'.blackrock.com/api/trading/order-management/order/v1/orders:filter'
r = requests.post(url,auth=(username,pwd),json=params,headers=generate_headers()).json()
print(r)

A quick analysis of the above code snippet helps us identify some key pain-points:

Setup/configuration and boilerplate code is verbose
Users are required to construct and provide header fields such as Authorization, Origin-Timestamp and Request-ID
Users are required to know the complete hostpath
Any changes to the API path/configurations in the future would require all users to also update their code
Users would need to validate response data types

Now, let’s look at the same request using the Aladdin SDK:

from aladdinsdk.api import AladdinAPI
 
api_instance_order = AladdinAPI("OrderAPI")
request_body = {
    "query": {
        "portfolioGroupCriterion": {
            "portfolioGroupTicker": "TST-TRD1",
        }
    }
}
orders_res = api_instance_order.post("/orders:filter", request_body)

Configurations can be provided in the following ways:

(Preferred) Using a configuration file. These configurations apply to all API calls in the script:

export ASDK_USER_CONFIG_FILE=

RUN_MODE: local
API:
  AUTH_TYPE: Basic Auth
  TOKEN: 
 
USER_CREDENTIALS:
  USERNAME: 
  ENCRYPTED_PASSWORD_FILEPATH:   
    # Password is optional while working locally since SDK can store and 
    # retrieve this secret from OS's Credential Manager

2. Environment variables can be used to override configurations given via the config file. These configurations also apply to all API calls in the script:

export ASDK_RUN_MODE=local
export ASDK_API__TOKEN=
export ASDK_API__AUTH_TYPE="Basic Auth"
export ASDK_USER_CREDENTIALS__USERNAME=
export ASDK_USER_CREDENTIALS__ENCRYPTED_PASSWORD_FILEPATH=

3. Configurations can also be simply provided in line, for when users want to try a different configuration, but not disturb the setup for the whole script. For example:

from aladdinsdk.api import AladdinAPI

api_instance_order = AladdinAPI("OrderAPI", api_key="")

Some key advantages of this approach are:

Users can setup API configuration once and have that available for all API calls in their code
Authentication, retry and error handling can be performed by AladdinAPI class — configurations help tweak the behavior of each functionality
Response validation is done by OpenAPI generated code

What about other authentication mechanisms?

In the above example, we used ‘Basic Auth’ and kicked off a ‘Long Running Operation’. What if we want to use a different authentication mechanism? Or follow up the action by a polling the LRO till the result is computed?

To switch from Basic Auth to OAuth:

Without the Aladdin SDK: Users would now have to setup more boilerplate code to use OAuth credentials and communicate with the Auth Server to get an “OAuth Access Token”. This would also require more educational material to be distributed for some users, and support them through the Basic Auth to OAuth migration process.

With the Aladdin SDK: As some eager readers might have already pointed out, Authentication Type is a configurable attribute on the SDK. Users would simply need to change their configurations to look something like this:

RUN_MODE: local
 
API:
  AUTH_TYPE: OAuth
  OAUTH:
    CLIENT_ID: 
    CLIENT_SECRET: 
    REFRESH_TOKEN:

Note: Additional configuration attributes let users point to files containing the secrets/tokens so sensitive information is not part of configuration files.

Long Running Operation (LRO) utility:

Without the Aladdin SDK, users must perform the following additional steps to retrieve the results of a long running operation:

Parse first API response to get the LRO-ID
Identify the LRO status check endpoint
Set up a polling loop
At responsible intervals, invoke the status check endpoint till the LRO is complete

With the Aladdin SDK: Simply invoke the awaitable call_lro_api method on the API wrapper, and provide the LRO status check endpoint path as a parameter. The SDK performs the steps to wire the LRO ID to the status check endpoint, and periodically ping the status check endpoint to see if the LRO has completed.

import json
import asyncio
from aladdinsdk.api import AladdinAPI
from aladdinsdk.common.datatransformation import json_to_pandas
 
def process_lro_response(lro_response):
    """
    Callback function to process the LRO response
    """
    print("===========================DONE==============================")
    json_object = json.dumps(lro_response['response'], indent=8)
    df = json_to_pandas.convert(json_object, "results.[*].output.order")
    print(df)
 
api_instance_order = AladdinAPI("OrderAPI")
orders_payload = {
    'postOrderConfig': { 'whatIf': False, 'runCompliance': True },
    'orders': [{
        "transactionType": "BUY",
        "assetId": 'TYU520157',
        "orderDetails": [{
            "portfolioReference": { "portfolioTicker": "TST-TRD1" },
            "quantity": 1000.0
        }]
    }]
}
lro = api_instance_order.call_lro_api(
   start_lro_endpoint=(("/orders:batchPost", "post")),  # start endpoint
   request_body=orders_payload,  # start_lro_endpoint request body
   check_lro_status_endpoint=(("/longrunningoperations/{id}", "get")),
       # endpoint to check status
   _deserialize_to_object=False,
   callback_func=process_lro_response)
   # Note: if no callback function provided, response is returned as is
 
asyncio.run(lro)

Note: This functionality is more flexible with configurations/utilities such as setting polling frequency, timeouts, retries and callback functions.

Connecting to the Aladdin Data Cloud (ADC)

Continuing with the compare and contrast approach, let’s take an example of connecting to the Aladdin Data Cloud — Aladdin’s cloud-hosted data and analytics platform built on Snowflake technology — to make a simple query and read the data into a pandas dataframe.

Below is the conventional method to create an ADC connection:

First, users must set up an RSA Key. Next, these steps should be carried out within the user session to execute a SQL query.

# Set import libraries
import os
import snowflake.connector
import pandas as pd
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives import serialization
 
os.environ['NO_PROXY'] = os.environ['no_proxy'] = '.snowflakecomputing.com'

passphrase = get_credentials(" ", "clientEnv")['rsa_passphrase']
#Add the path to your key
user = 'enter username'
acct = 'enter URL'
 
with open(" ", "rb") as key:  # path to key
    p_key = serialization.load_pem_private_key(
        key.read(),
        password=passphrase.encode(),
        backend=default_backend()
    )
 
pkb = p_key.private_bytes(
    encoding=serialization.Encoding.DER,
    format=serialization.PrivateFormat.PKCS8,
    encryption_algorithm=serialization.NoEncryption())        
 
     
ctx_read = snowflake.connector.connect(
    user=user,
    account=f'blackrock-{acct}', 
    private_key=pkb,
    role='ADC_ADMIN_ROLE',
    warehouse='CLI_STUDIO_WH',
    database='ALADDINDB',
    schema='INVESTMENTS'
)
 
df_pos = pd.read_sql("select * from pos_analytics LIMIT 10", ctx_read)

Immediately, we can identify the following pain points:

Verbose setup and configuration
Users are required to provide sensitive information such as their credentials/RSA token in code
Users are required to be aware of the account value (which might change in the event of a DR scenario)

The same query with the Aladdin SDK is much simpler:

The Aladdin SDK can infer most of the connection parameters from the ‘defaultWebServer’, but users are free to override all default configurations using one of the methods discussed above
Authentication is done using OAuth under the hood, by making an API call to Aladdin’s TokenAPI

from aladdinsdk.adc import ADCClient
 
adc_client = ADCClient()
 
df_pos = adc_client.query_sql("select * from pos_analytics LIMIT 10")

The above examples provide just a glimpse of the abstractions and utilities built into the Aladdin SDK. For more detailed documentation of each feature, please read the Aladdin SDK README.md

Building the Foundation for Domain-Specific SDKs

The core Aladdin SDK is designed with extensibility in mind. It provides a foundational structure and allows for the development of domain-specific SDKs tailored to unique quantitative needs and investment workflows. This modular approach allows the community to build on the existing framework, creating a rich ecosystem of specialized tools that cater to specific financial domains such as algorithmic trading, risk management, or portfolio optimization.

Domain-specific SDKs also have access to the same configuration management tools as the core Aladdin SDK, so developers are encouraged to implement their features using configurations. These SDKs can have their own configuration settings for fine-tuning feature-set provided by the SDK.

Open Sourcing Aladdin SDK

BlackRock relies heavily on open-source software. With the intention of giving back to the open-source community, and inviting the developers to contribute to and build on top of this project, Aladdin SDK is open-sourced and available at https://github.com/blackrock/aladdinsdk.

Conclusion

The Aladdin SDK offers a standardized, user-friendly, and extensible solution for Python developers to get started building faster. By empowering citizen developers and providing a solid foundation for the creation of domain-specific SDKs, it aims to reshape the landscape of quantitative finance applications. As the financial industry continues to evolve, the Aladdin SDK demonstrates our commitment to the power of open, collaborative, and innovative solutions in the quest for a quantitative edge.

Open Sourcing the Aladdin SDK: Empowering Python Developers with a Quantitative Edge was originally published in BlackRock Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Think Like an Octopus (in Python)

BlackRockEngineering — Thu, 21 Mar 2024 13:07:24 GMT

An octopus can move its tentacles in parallel. Michael Ziegltrum shows different tactics to enable your Python programs to move similarly.

By: Michael Ziegltrum, CFA, Vice President in Core Portfolio Management Architecture team

Pirates of the Caribbean 2: Dead Man’s Chest had some memorable scenes. One that stood out featured the pirate captain Davy Jones playing his organ. Jones is part octopus-part man and in the scene his tentacles moved synchronously over the instrument playing chords. Each appendage worked together to deliver something bigger. What if our Python programs could do similarly?

In fixed income portfolio management, there are many interesting problems where we could apply these ideas. In exchange traded funds (ETFs) portfolio managers (PMs) often evaluate a list of orders and compare them to the portfolio. For example, a PM might look at a list of orders and see what percent of the bonds are from companies in the banking sector. She might compare this metric in the list of orders to the portfolio’s allocation to banking. This is one metric, but the comparisons needed could be extensive. Hence the evaluation is a non-trivial piece of computation. As the scale increases via number of sectors, bonds, and portfolios, the time and memory usage to calculate metrics on an order list could be a problem. One idea is to evaluate multiple metrics at once like an octopus moving its tentacles in parallel. There are many different options towards this in Python including libraries like threading, multiprocessing, asyncio, ctypes, numpy, and more, each with varying trade-offs.

This post supplements an earlier post, “Citizen Developer Cookbook: Python Multiprocessing”, from a colleague, Casey Clements, who wrote up some in-depth recipes focusing on aggregation, monitoring, and exception handling with multiprocessing. In this post, we’ll ignore those important topics having been well-covered in Casey’s excellent post.

Executive Summary

This blog post looks at various options for calculating the market value percent of a sector in an ETF across many funds and sectors. This includes: the Python threading API; the Python multiprocessing API; a second approach to using the Python multiprocessing API; writing the processing in C++ and calling from Python; and using numpy’s vectorization. There are various trade-offs between these methods including simplicity, CPU and memory usage, runtime, and code safeness. Any production usage would need to weigh these trade-offs, but, in this toy problem, numpy is the winner.

Many of these trade-offs will not come as a surprise to savvy readers but in short: Python threading can only use one cpu; starting a new process using Python multiprocessing is heavy-weight and relatively slow vs starting a thread; we can use more than one cpu by writing C++ code and calling it from Python, with more complex and risky code; or we can leverage numpy which has its own C wrappers and vectorization that is probably better than what we would write.

Now, let’s get into it!

Problem Setup

Let’s say we want to write a program to take a list of orders and calculate how each sector’s market value compares to a given target.

Let’s model a portfolio and sector as per the below. This corresponds to a spreadsheet where each row is a bond and there are columns for market value percent and sector. If a bond is worth $5, in a portfolio valued at $100, that bond has a 5% market value percent. If that bond and another bond with market value of 2% are the only two bonds from companies in the banking sector, the banking sector allocation in the portfolio is 7%. In a real portfolio we would probably need to store identifiers for the bonds as well as other pieces of info, but we’ll ignore it in this toy example.

If you want a runnable script to follow along, a link is included at the bottom of this post.

@dataclass
class Portfolio:
    """
    Container class to keep track of market values weights and sectors for a portfolio.
    """
    market_value_pcts: "np.ndarray"
    sectors: "np.ndarray"
 
 
class Sector(Enum):
    """
    Enumeration to store sector information for bonds in a portfolio.
    """
    BANKING = 0
    BASIC_INDUSTRY = 1
    BROKERAGE_ASSET_MANAGERS = 2
    CAPITAL_GOODS = 3
    COMMUNICATIONS = 4
    CONSUMER_CYCLICAL = 5
    CONSUMER_NON_CYCLICAL = 6
    ELECTRIC = 7
    ENERGY = 8
    FINANCE_COMPANIES = 9
    FOREIGN_AGENCIES = 10
    INSURANCE = 11
    NATURAL_GAS = 12
    OTHER_INDUSTRY = 13
    OTHER_UTILITY = 14
    REIT = 15
    TECHNOLOGY = 16
    TRANSPORTATION = 17

We can instantiate some mock data with the below. Note that for each bond we store a 64-bit float and a 32-bit int. So, if we have 50,000 bonds, we store about 50,000 * (64 + 32) = 4,800,000 bits, or about 0.5MB. If we have 20 portfolios, this leads us to 10MB. On some Linux varieties, the limit to memory that we can use on the stack as local variables is 8MB, meaning for this data structure we would need to allocate memory from the heap. As such, this is a non-trivial chunk of data.

def prepare_mock_portfolio(num_bonds: int) -> Portfolio:
    """
    Helper function to prepare some fake portfolio data
    @param num_bonds: how many bonds to put in the fake portfolio
    @return:
    """
    market_value_pcts = np.random.uniform(size=(num_bonds,)).astype(np.float64)
    market_value_pcts /= np.sum(market_value_pcts)
    sectors = np.random.randint(low=0, high=len(Sector), size=(num_bonds,), dtype=np.int32)
    portfolio = Portfolio(market_value_pcts, sectors)
    return portfolio
 
 
num_bonds = 50000
num_portfolios = 20
portfolios = []
for i in range(num_portfolios):
    portfolios.append(prepare_mock_portfolio(num_bonds))

Let’s assume we want to calculate the sector weights vs targets for each portfolio, though in reality we may have more complex rules:

def log_tolerance(worker_num: int, sector: Sector, market_value_pct: float, target_weight: float):
    print(
        f"worker: {worker_num} {sector.name}: {market_value_pct:.3f}, target_weight: {target_weight:.3f}, "
        f"diff_from_tol: {(market_value_pct - target_weight):.3f}\n")
 
 
def calculate_sector_weights(worker_num: int, portfolio: Portfolio, target_weight: float):
    """
    Function to calculate how much market value is in each sector and log this for a given portfolio.
    @param worker_num: which worker is responsible for the calculation
    @param portfolio: portfolio container
    @param target_weight: a target weight to compare against market value pct
    @return:
    """
    sector_weights = [0.0] * len(Sector)
 
    for i in range(portfolio.sectors.shape[0]):
        bond_sector = portfolio.sectors[i]
        sector_weights[bond_sector] += portfolio.market_value_pcts[i]
 
    for i in range(len(Sector)):
        sector = Sector(i)
        mv_pct = sector_weights[i]
        log_tolerance(worker_num, sector, mv_pct, target_weight)
    return

Examining Approaches

All of the below approaches were run on Linux (Debian bullseye) and Python 3.9.18. Example output looks like the below:

...
worker: 0 TECHNOLOGY: 0.054, target_weight: 0.056, diff_from_tol: -0.002
 
worker: 0 TRANSPORTATION: 0.055, target_weight: 0.056, diff_from_tol: -0.000
...

Iterative Approach

As we look at ways to solve this problem, the first thing that may come to mind is the iterative approach:

def iterative_calcs(portfolios: List[Portfolio], target_weight: float):
    """
    Iteratively calculate and log the market value weights for each sector in each portfolio.
    @param portfolios: list of Portfolio instances
    @param target_weight: target sector weight to compare actual weight to
    @return:
    """
    for pf in portfolios:
        calculate_sector_weights(0, pf, target_weight)

This runs in about 0.54 seconds and during the run the most virtual memory used was 474MB. The most CPU used was 100%. The relevant PID is 94 — I’m using a wrapper to run and profile these (PID 82 in the below). This wrapper starts up a subprocess to run the sector calculations according to a given methodology like threading or multiprocessing and it watches the resource usage as it does so.

Can we do better than this?

Threading Approach

In a lot of programming languages threads let us use do more than one thing at a time while sharing the same memory. Could we evaluate more than one rule at once using threading?

def threading_calcs(num_workers: int, portfolios: List[Portfolio], target_weight: float):
    """
    Calculate and log the market value weights for each sector in each portfolio using one thread per portfolio.
    @param num_workers: how many threads to use at a given time. for example, 3.
    @param portfolios: list of Portfolio instances
    @param target_weight: target sector weight to compare actual weight to
    @return:
    """
    for pf_index in range(0, len(portfolios), num_workers):
        worker_list = []
        for thread_index in range(num_workers):
            pf_index = pf_index + thread_index
            if pf_index >= len(portfolios):
                continue
            args = (thread_index, portfolios[pf_index], target_weight)
            worker = threading.Thread(target=calculate_sector_weights, args=args)
            worker_list.append(worker)
            worker.start()
        [worker.join() for worker in worker_list]

This runs in about 0.52 seconds and during the run the most virtual memory used was about 667MB. The most CPU used was 113%.

Compared to the iterative approach we use 200MB more memory. We have improved run time by only 0.02 seconds, though. We may have expected to cut run time down by a third, given we use 3 workers. What gives?

The key is in our CPU usage which is also around 100%. If we were truly doing multiple things at a time, this would have been 300% (representing using 100% of 3 CPU cores). Python, as a language, though, has many built-in features to protect Python programmers from people like themselves.

One such mechanism is the global interpreter lock (GIL) which makes it so that only one thread can access the interpreter and do something at a given point in time. Many third-party libraries like numpy have C code that circumvents this, but if we are using pure Python we cannot get around this lock.

In future versions of Python, we may escape the GIL (see discussion here), but, for now, threading is not a good fit for CPU bound tasks like this one. It may have been a better fit if we had moved some of these calculations to a microservice that we could then call via an http or other call.

So, why was the CPU usage slightly over 100%? Probably the thread switching, which may use more than 1 CPU, but I leave this for discussion.

Multiprocessing Approach

There is another Python library that can do more than one thing at a time. The multiprocessing library starts up processes. Processes are heavier weight than threads and do not share memory space, so we will have to worry about passing data back and forth between processes. That said, the API was designed to be similar to the threading API so an initial implementation might look like the below:

def multiprocessing_calcs_bad(num_workers: int, portfolios: List[Portfolio], target_weight: float):
    """
 
    @param num_workers: how many processes to use at a given time. for example, 3.
    @param portfolios: list of Portfolio instances
    @param target_weight: target sector weight to compare actual weight to
    @return:
    """
    for pf_index in range(0, len(portfolios), num_workers):
        worker_list = []
        for thread_index in range(num_workers):
            pf_index = pf_index + thread_index
            if pf_index >= len(portfolios):
                continue
            args = (thread_index, portfolios[pf_index], target_weight)
            worker = multiprocessing.Process(target=calculate_sector_weights, args=args)
            worker_list.append(worker)
            worker.start()
        [worker.join() for worker in worker_list]

This runs in about 0.23 seconds, a bit better than half the run time of the iterative or threading approaches. Why is this? We can now do more than one thing at a time, like an octopus playing a church organ.

In the screenshot below we can see that we have spawned more Python processes, and each of those processes (PID 351, 352, 353) can use up to a full CPU. As an aside, in the screenshot we only see 6.7%. This is because I took the screenshot at the wrong time. In reality they probably used 100% CPU for a brief instant then exited.

However, this is at the cost of using significantly more memory (perhaps 1.5 gigabytes more) than prior approaches because each process has to have a Python setup and that large mock data that we are working with.

Another drawback is that we again may have expected to cut down run time to a third of the iterative approach given we use three workers, but only cut it in half. What gives?

Well, starting up a process is an intensive thing, and in the above code we start up one process for each portfolio. Additionally, sending data to the new process like our portfolio list can take time.

Can we do better?

Multiprocessing Approach With Fixed Workers

With what I admit is a non-trivial amount of new code, we can bring up 3 workers and leave them up throughout the program, sending jobs to them. This should save process start up time. We also will copy the portfolio list just once to each of the three processes, rather than sending it once per portfolio. In this example I don’t expect this latter bit to save a lot of time here but it could be more relevant for you depending on size and structure of your data.

@dataclass
class MPCalcSectorWeightArgs:
    """
    Container class for arguments to a multiprocessing worker to calculate portfolio weights.
    """
    exit_flag: bool
    pf_index: int
 
 
def calc_sector_weights_wrapper(args_queue: Queue, worker_num: int, portfolios: List[Portfolio], target_weight: float):
    """
    Helper function to bring up a worker and keep it up looking for new jobs to run in the queue. Will exit when it
    receives a job with an exit flag set in the queue.
    @param args_queue: a queue to poll for jobs or to exit
    @param worker_num: which worker this is
    @param portfolios: list of Portfolio instances
    @param target_weight: target sector weight to compare actual weight to
    @return:
    """
    exit_flag = False
    while not exit_flag:
        if args_queue.qsize() > 0:
            args = args_queue.get()
            pf_index = args.pf_index
            exit_flag = args.exit_flag
            if exit_flag:
                continue
            calculate_sector_weights(worker_num, portfolios[pf_index], target_weight)
 
 
def multiprocessing_calcs(num_workers: int, portfolios: List[Portfolio], target_weight: float):
    """
    Calculate and log the market value weights for each sector in each portfolio using the given number of workers and
    keeping these processes up throughout the run to save on process start up time.
    @param num_workers: how many threads to use at a given time. for example, 3.
    @param portfolios: list of Portfolio instances
    @param target_weight: target sector weight to compare actual weight to
    """
    worker_list = []
    q_list = []
    for i in range(num_workers):
        q = multiprocessing.Queue()
        p = multiprocessing.Process(target=calc_sector_weights_wrapper, args=(q, i, portfolios, target_weight))
        worker_list.append(p)
        q_list.append(q)
    [p.start() for p in worker_list]
    for pf_index in range(0, len(portfolios)):
        args = MPCalcSectorWeightArgs(False, pf_index)
        worker_num = pf_index % num_workers
        q_list[worker_num].put(args)
    for worker_num in range(num_workers):
        args = MPCalcSectorWeightArgs(True, 0)
        q_list[worker_num].put(args)
    [worker.join() for worker in worker_list]

How did we do? Runtime is now around 0.20 seconds. We saved 13% versus the prior implementation. This is 0.03 seconds which may not be significant in the scheme of the 2–3 years of a typical blue-ringed octopus’ life, but is a significant improvement in the computer world!

On memory, we did not improve nor did we expect to. With regards to CPU usage, as the processes below stay up longer than the processes in the prior example (given we only have 3 and we are sending jobs to them and keeping them alive), we see higher CPU usage. That said, this is only reflecting the fact that the processes are up longer and I have time to get the output from the top command.

If we wanted to push start up time more, the multiprocessing API supports other ways to pass data to processes and we might find that some are faster than others (like using shared memory).

C++ Wrappers and Recreating the Wheel

We are doing quite a lot of work here to get around Python’s limitations on threading. Specifically, we are starting processes, passing data to them, and so on. But, Python has good integration with C and C++, and we could simply write our heavy-duty processing in C++ instead and use threading there, to avoid the Python global interpreter lock. We can create Python bindings for this using the library ctypes. This is a lot riskier, but let’s look at an example.

In the below we set up one function in C++ to start threads and call the second function which calculates and writes the result to an existing output array.

/**
 * Looks at the input pf_num argument and writes the sector weights of that portfolio to the portfolio_sector_weights argument.
 * No size checking is done on the input pointers, so use this function with great caution. See argument descriptions
 * for input pointer structure.
 *
 * @param  pf_num which portfolio to calculate, used to move to relevant section of memory in below pointers.
 * @param  portfolio_bond_weights  pointer of size num_portfolios * num_bonds, flattened. Each entry is the market value percent of a bond.
 * @param  portfolio_bond_sectors pointer of size num_portfolios * num_bonds, flattened. Each entry is the sector of a bond in integer representation.
 * @param  num_bonds how many bonds per portfolio, assumed to be the same for each portfolio. Used to move to relevant memory section.
 * @param  portfolio_sector_weights the output array to write portfolio sector weights to. Size of num_portfolios * num_sectors.
 * @param  num_sectors the number of sectors in the sector scheme. Used to move to relevant section of memory in portfolio_sector_weights.
 * @return      void
 * @see         calculate_sector_weights
 */
void worker_calculate_sector_weights(const uintmax_t pf_num, const float64_t *portfolio_bond_weights, const int32_t *portfolio_bond_sectors,
                                     const uintmax_t num_bonds,
                                     float64_t *portfolio_sector_weights, const uintmax_t num_sectors)
{
    const uintmax_t starting_out_index = pf_num * num_sectors;
    for (uintmax_t i = 0; i < num_sectors; i++)
    {
        portfolio_sector_weights[starting_out_index + i] = 0.0;
    }
 
    const uintmax_t starting_bond_index = pf_num * num_bonds;
    for (uintmax_t bond_index = starting_bond_index; bond_index < starting_bond_index + num_bonds; bond_index++)
    {
        const int32_t §or = portfolio_bond_sectors[bond_index];
        const float64_t &market_value_pct = portfolio_bond_weights[bond_index];
        portfolio_sector_weights[starting_out_index + sector] += market_value_pct;
    }
}
 
extern "C"
{
    /**
     * Batches out calculations of sector weights in portfolios to threads. Uses a max of three threads at a time. Writes the sector weights to the
     *  portfolio_sector_weights argument. No size checking is done on the input pointers, so use this function with great caution.
     *  See argument descriptions for input pointer structure.
     *
     * @param  portfolio_bond_weights  pointer of size num_portfolios * num_bonds, flattened. Each entry is the market value percent of a bond.
     * @param  pf_size how many portfolios in input pointers, used to move to relevant section of memory in below pointers.
     * @param  portfolio_bond_sectors pointer of size num_portfolios * num_bonds, flattened. Each entry is the sector of a bond in integer representation.
     * @param  num_bonds how many bonds per portfolio, assumed to be the same for each portfolio. Used to move to relevant memory section.
     * @param  portfolio_sector_weights the output array to write portfolio sector weights to. Size of num_portfolios * num_sectors.
     * @param  num_sectors the number of sectors in the sector scheme. Used to move to relevant section of memory in portfolio_sector_weights.
     * @return      void
     */
    void calculate_sector_weights(const float64_t *portfolios_bond_weights, const uintmax_t pf_size,
                                  const int32_t *portfolios_bond_sectors, const uintmax_t num_bonds,
                                  float64_t *portfolios_sector_weights, const uintmax_t num_sectors)
    {
        const uintmax_t num_workers = 3;
        std::thread thread_arr[num_workers];
 
        for (uintmax_t i = 0; i < pf_size; i++)
        {
            if (i % num_workers != 0)
            {
                continue;
            }
            for (uintmax_t j = 0; j < num_workers; j++)
            {
                uintmax_t pf_num = i + j;
 
                if (pf_num >= pf_size)
                {
                    continue;
                }
                thread_arr[j] = std::thread(worker_calculate_sector_weights, pf_num, portfolios_bond_weights, portfolios_bond_sectors,
                                            num_bonds, portfolios_sector_weights, num_sectors);
            }
            for (uintmax_t j = 0; j < num_workers; j++)
            {
                if (i + j >= pf_size)
                {
                    continue;
                }
                thread_arr[j].join();
            }
        }
    }
}

We can load this into Python by using the ctypes library and create a wrapper around it to do some argument validation. Then we can call this function as we’ve called previous functions.

def setup_cpp_lib():
  """
  Helper function to load in the shared library object for the C++ functions and set some basic argument validation
   on the functions in the library. Relies on the .so file being generated and in the working directory
   of the python file.
  @return:
  """
    cpp_lib = ctypes.cdll.LoadLibrary("./pf_sector_helpers.so")
    calculate_sector_weights = cpp_lib.calculate_sector_weights
    calculate_sector_weights.restype = None
 
    calculate_sector_weights.argtypes = [ndpointer(ctypes.c_double, flags="C_CONTIGUOUS"), ctypes.c_size_t,
                                         ndpointer(ctypes.c_int32, flags="C_CONTIGUOUS"), ctypes.c_size_t,
                                         ndpointer(ctypes.c_double, flags="C_CONTIGUOUS"), ctypes.c_size_t]
    return cpp_lib
 
 
def cpp_calc_sector_weights(portfolios: List[Portfolio], target_weight: float, cpp_lib):
  """
  Python wrapper function for the C++ function to calculate sector weights. Calculates and logs the market value
   weights for each sector in each portfolio using one thread per portfolio.
  @param portfolios: list of Portfolio instances
  @param target_weight: target sector weight to compare actual weight to
  @param cpp_lib: shared library loaded using setup_cpp_lib
  @return:
  """
    sector_array = np.concatenate([pf.sectors for pf in portfolios], axis=None, dtype=np.int32)
    mv_arrays = np.concatenate([pf.market_value_pcts for pf in portfolios], axis=None, dtype=np.float64)
 
    if sector_array.shape != mv_arrays.shape:
        raise ValueError("sectors must have same number of entries as bonds")
 
    num_portfolios = len(portfolios)
    num_sectors = len(Sector)
    num_bonds = portfolios[0].market_value_pcts.shape[0]
    out_arr = np.zeros((num_portfolios * num_sectors,), dtype=np.float64)
 
    cpp_lib.calculate_sector_weights(mv_arrays, num_portfolios, sector_array, num_bonds, out_arr, num_sectors)
 
    for pf_index in range(num_portfolios):
        row_index = pf_index * num_sectors
        for i in range(num_sectors):
            sector = Sector(i)
            log_tolerance(0, sector, out_arr[row_index + i], target_weight)

This runs in about 0.06 seconds, a substantial improvement from our prior methods. In fact, I had to use a new method to examine CPU and memory usage because it ran too quickly for my prior method. We see CPU Usage between 200% and 400% at various times in the script. I am running on a computer with 6 Cores and 12 logical processors. Using 3 threads in C++ I would expect up to 300% in top used. In reality, I am measuring at a point in time and threads could be spinning up or down, so I don’t see exactly these numbers. That said, usage over 100% indicates we are doing more than one thing at a time. C++, unlike Python, does not have a global interpreter lock (GIL), though production multithreaded code employs various locks and synchronization techniques to protect shared variables and such.

Memory usage is similar to the iterative method, given we are still using Python to call the C++ code. As an aside, at one point in the program CPU usage was 1200% (using each of my 12 logical processors) but this was numpy operations to make mock data and we will talk about numpy shortly.

The C++ example achieves performance gains, but at the cost of significantly more complex code. Note that we had to work with pointers opening the door to segmentation faults and I didn’t do as much argument validation as I should have. I also played pretty fast and loose with shared resources in the threads. This is risky, there could be memory related bugs lurking just below the surface!

I wonder if someone already done an implementation of these calculations in C with some controls that also can take advantage of parallel computing?

numpy

The answer is yes, numpy has done a lot of work to enable fast vectorized calculations that take advantage of C and parallel processing. It also has a lot more controls and safeguards in place than the code I wrote above.

def numpy_calc_sector_weights(portfolios: List[Portfolio], target_weight: float):
  """
  Calculates and logs the market value weights for each sector in each portfolio using numpy to achieve parallel
   processing.
  @param portfolios: list of Portfolio instances
  @param target_weight: target sector weight to compare actual weight to
  @return:
  """
    sector_array = np.array([pf.sectors for pf in portfolios], dtype=np.int32)
    mv_arrays = np.array([pf.market_value_pcts for pf in portfolios], dtype=np.float64)
    num_portfolios = len(portfolios)
    num_sectors = len(Sector)
    out_arr = np.zeros((num_portfolios, num_sectors,), dtype=np.float64)
 
    for sector_index in range(num_sectors):
        sector_mask = np.equal(sector_array, sector_index)
        sector_sums = np.sum(mv_arrays, axis=1, where=sector_mask)
        out_arr[:, sector_index] = sector_sums
 
    for pf_index in range(num_portfolios):
        for i in range(num_sectors):
            sector = Sector(i)
            log_tolerance(0, sector, out_arr[pf_index, i], target_weight)

This runs in about 0.08 seconds, about the same as our self-written method. Notably, CPU usage is higher than in the C++ code I wrote, probably because numpy has different default limits set on the number of cores it can use at once (I set number of threads to 3 in my C++ example). That said, it was tricky for me to identify the CPU usage at the point in the function that does the sector calculations and so how much higher I’m not sure about. I’m sure we could set limits on this if we wanted to, even via docker if needed.

Other folks may be able to come up with a solution that leverages numpy to vectorize over sectors as well, rendering the for loop over sectors unnecessary and parallelized.

Regarding controls and safeguards, in numpy if I try to access an element outside an array, I get an error. In the C++ code I wrote above I do not check sizes and do validation. Instead, I parse memory addresses to a double. If I access an out-of-bounds element due to bad input or a bug, I will read memory I didn’t intend to. Either this memory doesn’t belong to me and I will cause a seg fault or I will silently read something I didn’t intend to. Either are poor outcomes. Hence, the validation numpy does is useful.

Python 3.9.18 (main, Feb  1 2024, 06:03:49)
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> arr = np.array([1, 2])
>>> arr[10]
Traceback (most recent call last):
  File "", line 1, in 
IndexError: index 10 is out of bounds for axis 0 with size 2

There are other examples of controls and safeguards numpy does for us that I don’t go into here.

Third Party Libraries

There are other libraries for parallel computing in Python. My colleague, Casey Clements, who wrote the blog post that I mentioned here briefly mentions Dask, Airflow, and Prefect.

Conclusion

Rather like an octopus trying to ambush a clown fish in a coral reef, there are many ways to approach parallel computing in Python. If I was reviewing a pull request implementing parallel computing (or ideally brainstorming with colleagues before a pull request ever came out) there are a few factors I would weigh:

What is our current tech stack? If we are mainly a Python team, I will be biased to look at Python libraries like numpy that handle implementations.
How many rules and how many bonds do we have? This will influence whether we prioritize approaches emphasizing memory, multiple CPUs, or other aspects. As the number of rules increases we get more benefit from the multiprocessing API, given we can amortize the process startup cost.
How fundamental and how scalable does the process need to be? For a library that we rely on extensively, that needs to run fast many millions of times a day, it may make sense to do a barebones C++ implementation.

There are many other topics in threading that we didn’t discuss today like synchronization and deadlocks.

Looking at the approaches in this blog, for this problem, numpy is the winner for me. This is because it has similar performance to barebones C++ but with more controls, safeguards, and features in place.

I hope you all enjoyed this discussion. See full runnable source code examples in this repo.

Think Like an Octopus (in Python) was originally published in BlackRock Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Introducing InGen

BlackRockEngineering — Thu, 01 Feb 2024 13:35:20 GMT

An Open Source ETT (Extract, Transform, Transfer) Python Tool

By: Swarna Dhakad, Senior Engineering Team Director, Aladdin Wealth Tech, & Piyush Ranjan, Engineer III, Aladdin Wealth Tech

From ETL to ELT to ETT

Interfaces allow two different systems to exchange data between each other. An example of an interface is a file which contains data in rows and columns. Think of an Environmental, Social and Governance (ESG) data provider sending a daily feed of files containing security identifiers and their ESG scores to an asset manager. Overnight feeds of analytics data, custodial ingestion of accounts and positions are other examples where one system ‘interfaces’ with another by exchanging data in an agreed upon format. Even in the age of APIs and real-time, streaming data, such asynchronous data transfer is still an extremely prevalent use case in the fintech industry.

Therefore, a common requirement is to be able to generate these interfaces repeatedly and automatically. While the purpose of the data generated will vary as per the business logic, the fundamental operations needed to extract the data are typically within a finite set of data operations like reading data from different sources, performing some data massaging and writing it to its destination in a predefined format.

The financial industry, and specifically asset management, has focused a lot on the data integration problems and has built a lot of solutions for doing ETL (Extract, Transform, Load), ELT (Extract, Load, Transform) operations. Most of these solutions require high expertise and skills to use these tools, which requires investing a lot of time in learning and upskilling. Often, we do not need such high-end tools and operations and just want to extract data from some sources and write it in a format that can be ingested to the destination system. We call this pattern Extract, Transform, and Transfer (ETT). ETT can be thought of as extracting data from various sources, transforming it into the required format and sending it to another system through files or APIs.

This was a very common use case at BlackRock in the Aladdin Wealth Tech Business, where we integrate with several external sponsor platforms, custodians, and other systems within the firm. Most of these data exchanges are daily and happen overnight. In the spirit of not reinventing the wheel, we first used existing, available tools for several use cases. However, we began to feel that it would be better for us in the long term to separate the business logic of extraction from the process of extraction. It was time for a better wheel, perhaps?

Introducing: InGen

Our ideas led us to create a new solution, InGen, a Python-based command line tool that allows the user to generate interface files from various sources like databases, files, and HTTP APIs, without writing any new code. The process is completely config-driven, requiring only a configuration file in YAML format, which declares the data sources, the formatting operations to be applied on the data, validations, and the output format.

An important aspect of InGen is that the configuration files are easy to write and can be built and maintained by non-developers, as well. This allows for a much broader set of users that can create their own complex data extracts without needing to involve a software team or write a single line of code.

This tool has been built on top of Pandas and Great Expectations library. The data from the defined sources is read into a pandas dataframe and goes through a series of transformations as described in the pre-processing and formatting stages of the configuration file. The data transformations can be thought of as a pipeline which you construct.

Pre-processing steps combine data from multiple sources — SQL like joins, concatenation of multiple data frames or filtering of duplicates can be performed in this stage. At the end of this stage, we should have a single data frame which is then passed to the next stage where column level formatting is applied.

Once the data has gone through all these steps, the last step is to write it. The most common example is to write this data to a file in tabular structure. However, InGen also supports transforming data into JSON and pushing it to a web API.

Real-World Example

An example of where we used InGen was for a project to integrate one of our operational processes with an external system. We needed to pick up an Excel file sent to us, translate it into a specific JSON format that the external system can ingest, and do some validations along the way. This was solved by using InGen’s file reader, JSON formatter, API writer and source validations.

Here, you can see the config which includes the data source, pre-processing steps on the source, and the desired output format.

Below shows the destination configuration, which is where we want to send the data. Here, we configure the URL of the API that we are calling and the authentication method.

Finally, below shows the formatting steps where translation happens from the source to the destination. Here the column ‘Residence State’ in the source is being renamed to ‘State’. The ‘Portal Date’ column is being renamed to ‘Date’ in the destination as well as undergoing a date formatting.

Open Sourcing InGen

BlackRock relies heavily on open-source software and increasingly aims to give back to the open-source community. To realize the full potential of this configurable command line tool we are making InGen an open-source project with two goals in mind:

Allow others to use this tool for their interface generation processes.
Welcome the open-source community to contribute to and enhance this project.

To use InGen, follow the guidelines in the Getting Started section of the README and to contribute to this project checkout our open issues or contribution guidelines.

Learn more about Aladdin and technology careers at BlackRock.

Introducing InGen was originally published in BlackRock Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How BlackRock’s hackathon empowers employees to be innovative

BlackRockEngineering — Tue, 19 Dec 2023 21:42:07 GMT

By: Our partners at The Muse

Mael Pidjou was barely six months into his role as an Associate on our Aladdin Client Business team when a manager asked him if he might have an idea for HACK:BLK, the firm’s annual hackathon creates the opportunity for creative problem solving through technology. And despite being a fresh face at the firm, Mael had just the thing.

In his role, the Tokyo-based Mael works closely on Aladdin, BlackRock’s investment management technology, often delving into the firm’s deep reservoirs of data to find pertinent information for clients. When generative forms of AI, such as ChatGPT, began to dominate the news this year, Mael wondered: What if this kind of machine learning could search the database more easily, freeing up more time to tackle new projects?

Mael Pidjou, Associate, Tokyo

“Almost every person I had discussed my problem with said it was a pain point for them, too. In general, everyone wants to find information faster,” Mael says.

That inspiration became the basis of Smart Sensei, the winning project of the 2023 edition of HACK:BLK. The hackathon, which will celebrate its 10th anniversary in 2024, invites employees to collaborate with colleagues from their region, including some they may never have met. Regional teams are given a short time to virtually plan a project and just two days of around-the-clock hacking to bring an idea to fruition. Then, finalists create video presentations to showcase their prototypes, and employees vote to select one winning project.

Mimi Narbonne, a Vice President on the Aladdin Engineering Chief Operating Officer team, says the hackathon is a perfect example of one of the firm’s guiding principles, ‘One BlackRock.’ The concept refers to the power of working collaboratively and building connections across the company — from engineering and communications to sales and legal — to get things done.

“The hackathon wouldn’t happen without the whole firm being involved because the purpose is to think innovatively, and we have to bring everyone together to do that,” Mimi says.

Mael’s 2023 hackathon team is an illustrative example. The team came together during a HACK:BLK meeting in the Asia-Pacific region, where Mael and other employees pitched their ideas and formed small groups.

Mimi Narbonne, Vice President, New York

There were two other members from the Tokyo office (which inspired Smart Sensei’s name) plus three colleagues from India, and everyone came from different teams within those offices. An Analyst had studied a little machine learning in college and was keen to see it applied in the real world. Two engineers were experts in machine learning and language models. And a Director who joined provided project management experience in taking an idea all the way from A to Z.

“The hackathon really proved to me that BlackRock has so many skillful people all over and that it’s possible to collaborate with different teams and different regions,” Mael says.

According to Mimi, HACK:BLK is a way to not only inspire grassroots ideas that could improve the firm’s operations, but also create a culture of innovation in which employees from every division — not just engineers — can hack, too. In fact, hackathon teams are often made up of employees with engineering skills as well as those who want to contribute in non-technical ways. The event also shows employees that smart solutions can come from anyone, rather than just top-down.

“We’re spreading our tech culture across the firm,” Mimi says. “We’re inviting everyone to bring their ideas, and then connecting them with engineers to take their ideas across the finish line because, in the end, you need a working prototype.”

“Most people on the technology side are well aware of what’s expected from a hackathon, but we want to encourage employees on the business side to get involved, too, and to think like technologists,” adds Jing Chen, a Director on the Aladdin Product Service Engineering team based in Atlanta and one of the six 2023 hackathon directors.

Jing Chen, Director, Atlanta

In previous years, HACK:BLK asked teams to innovate within a specific category or field such as cloud, blockchain, or sustainability. But in 2023, the directors took a different approach, choosing a broader category called “operational excellence” in the hopes of inspiring even more employees to get involved. The challenge? To come up with ways to be more productive and efficient.

“I was a little worried that not as many people would participate because the theme wasn’t as tech-driven,” Jing says. “But everyone loved it. In the end, operational excellence had the most project submissions ever. It was a great experience for me to see the excitement around a common vision and goal. The hackathon helps BlackRock advance its technology and have a stronger employee community.”

Sometimes, HACK:BLK ideas do lead to real products, with engineering leaders adding them to their roadmaps. Mimi recalls a past hackathon project that was implemented within Aladdin. “The team didn’t come in first place, but their idea is now a real business,” she says. “A lot of times, it’s not a flashy project you see in the headlines.”

Take Smart Sensei. According to Mael, “Our use case caught the eyes of an AI team in London, and I’m going to be meeting with somebody about collaborating on it. Hopefully by the end of the year, we’ll have some beta users testing it.”

Before participating in HACK:BLK, Mael never imagined he’d have the chance to pursue his idea. “I was surprised at first that my manager pushed me to go and do it,” he says.

And now, he knows better. “There is a mindset at BlackRock that you should be like a student — always learning and improving,” he says. “The hackathon is quite representative of the company, and how we always try to find new ways to do things.”

A version of this blog was originally published on The Muse & the BlackRock Careers site.

How BlackRock’s hackathon empowers employees to be innovative was originally published in BlackRock Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Similarity Learning, the art of identifying neighbors

BlackRockEngineering — Thu, 16 Nov 2023 17:44:28 GMT

BlackRock data scientists discuss how they think about similarity and advanced risk and portfolio management processes

By: Members of the Aladdin Financial Engineering (AFE) team at BlackRock

Stefano Pasquali, Managing Director & Head of AFE Investment AI Modeling & Research, Philip Sommer, Senior Director of Product, Head of AFE Liquidity and Trading Analytics, Dhagash Mehta, Senior Principal Data Scientist, Leads Core AI Research within AFE and Dhruv Desai, Data Scientist III in AFE Investment AI Modeling & Research

What is Similarity Learning? According to search engines similarity is a relation between objects that constitutes how much these objects are alike. There may also be found many synonyms of similarity such as resemblance, comparability, likeness, correspondence, etc.

In our day-to-day routine we frequently leverage the concept of similarity. For example, while searching for a product on e-commerce sites, we see products which are relevant (“similar”) to our target search. This idea can be easily extended to other use cases, like finding a house to buy, recommending friends on a social network, or identifying top results on a search engine. In each of these examples we are trying to identify neighbors for a given object or a group of objects or filters or a query.

At BlackRock, within the Aladdin Financial Engineering team, data scientists are a diverse group of individuals, that each bring unique skills and experiences from various quantitative fields like computer science, mathematics, theoretical physics, optimization, trading, and investment research. Collectively, our team is responsible for delivering analytical solutions specifically from an applied machine learning point of view. Using our diverse background we have analyzed multiple aspects of trading and liquidity for bonds, equities and other asset classes including mutual funds and ETFs, from similarity learning point of view. We settled on framing similarity as a mathematical concept, i.e., similarity measures how far or close two objects are in some chosen space of variables. Then, our task to correctly quantify the similarity becomes to correctly quantify the distance between objects in this space.

Similarity has many applications in the world of finance, as well. For example, identifying “liquid substitutes” in trading and investment processes helps to efficiently source liquidity and significantly improve fill rates, as well as reduce transaction costs, while also reducing the negotiation cycles between traders and portfolio managers. This is a crucial application of bond similarity. Another one is discovering fair pricing for illiquid securities with little to no observable data on the target bond. Here we need to rely on price movements of “similar” securities. In general, similarity between every financial object can help identify alternative, and possibly more efficient, investments at every level of granularity.

Over many years of doing systematic research, our team of data scientists have shown that, instead of manually imposing a distance metric such as the Euclidean distance to determine similarity between two objects, we can get significantly better results incorporating more advanced techniques. We can use machine learning (ML) to learn the appropriate distance metric for the given dataset! More specifically, we have focused on tree-based methods to learn similarity in a supervised fashion.

These methods include decision trees, random forests, and gradient boosting trees to learn the distance metric. A method using random forests, for example, was published by our team in this article and has since been expanded other tree-based methods with appropriate algorithm-specific adjustments.

We have been able to develop advanced trading analytics using similarity learning as well as using other ML and quantitative techniques, so that portfolio managers can add considerable value to their clients’ portfolios across both index and active strategies. This directly improves trading processes by replacing heuristics-based sector categories broadly used in portfolio management and risk factors with dynamic, data-driven cohorts of similar instruments. In turn, help identify liquid substitutes, improving order fills and order matching. See our related papers here: article-1, article-2, article-3, article-4, article-5, article-6

These new advances in data science are providing interesting ways to advance risk and portfolio management processes, and this is just the beginning!

Learn more about technology careers at BlackRock.

Similarity Learning, the art of identifying neighbors was originally published in BlackRock Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Cranking the Voltage on Spark: Achieve Peak Performance with Optimization

BlackRockEngineering — Wed, 11 Oct 2023 22:00:46 GMT

Soumyata and Mohi describe their use of Apache Spark framework to generate sophisticated risk analytics for millions of portfolios in minutes. They deep dive into Spark optimizations and share some tricks to increase the efficiency of Spark applications.

By: Soumyata Binani, Engineer I, & Mohi Mohi, Engineer I in Aladdin Wealth Tech Distributed Compute team, BlackRock

Whether you’re a seasoned Spark aficionado or just embarking on your journey, delving into the nuances of Spark optimization can make a notable difference in the pace, effectiveness, and scalability of your big data processing pipelines.

Here at BlackRock in Aladdin Wealth, we manage to generate sophisticated risk and portfolio analytics for millions of portfolios in less than 30 minutes. Curious to know how? Well, the secret lies in our use of the Apache Spark framework. So, let’s take a plunge into the nitty-gritty and discover some of the tricks up our sleeves that help us squeeze out the maximum juice of efficiency from our Spark applications!

Taming the Power Duo: Driver and Executors

Imagine your Spark application as a team, with two key players: the driver and the executor. The driver leads the game, dividing and assigning tasks to the executors. The executors (the dedicated task doers) work in parallel, utilizing CPU cores and memory effectively to complete the tasks together. Configuring this duo is as much an art as a science. While drivers are straight-forward and can run with a minimal set of hardware, determining the ideal executor configuration for a Spark job is a challenging task.

“How should one choose between plump vs lean executors, i.e., a few large executors or many small executors?”

For instance, we are submitting a job to a cluster that has a total of 150 GB of memory and 20 cores; do we create five 30 GB four-core executors, or ten 15 GB two-core executors? While there is no one-size-fits-all answer, understanding the implications of each approach in terms of CPU and memory can guide our decision-making process.

Lean, Mean, Task-eating Machines?

Even though using lean executors may seem advantageous in terms of resource utilization, it comes with two potential downsides:

Running out of resources to compute a partition can lead to memory issues and disk spilling during shuffling or when dealing with unbalanced data.
Having an excess of executors might not use resources effectively because of communication overhead. Each executor brings some overhead with it, and if they’re too small, a big part of their space gets used for overhead instead of actual work. For instance, having lots of 1 GB executors leads to about 25 percent of each one’s space being taken up by overhead.

…or Big Data Hoggers?

Employing plump executors can also be wasteful. Placing very large executors on nodes (VMs) may result in unused space (memory or cores) on them. Additionally, it may cause delays in garbage collection and poor performance if there are too many cores per executor.

Ultimately, the optimal choice depends on factors like memory constraints, partition computation, communication overhead, and performance considerations, and a careful evaluation of resources necessary for configuring an application in a Spark cluster.

Better Distribution == Better Performance

Spark takes a relaxed approach while transforming an RDD/Dataframe, waiting until you trigger an action for the result. When launching a Spark application, the driver program converts it into a single job. A job consists of stages, representing unique data operations that can run simultaneously or sequentially. Some stages run concurrently while others wait. Stages comprise tasks that can be executed in parallel. To handle large data volumes efficiently, Spark divides the input dataset into smaller partitions. These partitions are then distributed across multiple executors, where tasks process their assigned data diligently.

Divide and Conquer

Effectively dividing your data into partitions allows your resources to conquer the workload! While Spark automatically breaks the data into partitions during loading, the repartition() method can be utilized to re-partition the data according to the specific requirements of your application. Beware, using repartition() method may cause data to be shuffled between the executors to create the desired number of partitions.

Suppose you have at your disposal 20 cores and 100GB of memory to work with, and your task is to process 10,000 records. Now, let’s say you have configured four executors, each with five cores and 25GB of memory.

In this example, the input file is split into 20 partitions, and since we have 20 cores available, all partitions are processed simultaneously in parallel. Here we see a 1:1 mapping between partitions and cores but typically, the number of partitions should be a multiple of the total cores for optimal performance. If the executors encounter memory constraints, increasing the number of partitions can help spread out the memory footprint. For instance, we could have used 40, 80, or even 200 partitions if needed.

In a case where you are certain that you must reduce the number of partitions, you can utilize a nifty little method called coalesce(). This will help repartition the data without shuffling data between executors.

Skewness — thief of our perfectly distributed world

The previous example may have seemed overly idealistic because in practice, data records exhibit variations in size and, consequently, differences in computation times are observed. In Spark terminology, this phenomenon is known as “data skewness”, which can result in delays and resource overload.

One way to analyze if this is happening in your application is by checking stage summary metrics, which can be accessed by going to yarn-ui -> Stages tab -> Select a stage

The above case, albeit an extreme one, illustrates the impact of skewed tasks. The entire stage experiences a bottleneck due to the task that takes the maximum time, causing certain executors to remain idle for an extended period while waiting for the largest partition to finish processing on a specific executor.

The above graphs (explained in detail in a later section), show the heap utilization (consequently CPU utilization) of two executors during a run. One executor is constantly active while the other remains idle, waiting for the first executor to complete processing.

One easy way to fix this situation is by using lean executors, which should mitigate the impact of this skewness. However, lean executors may not be applicable in all cases. In all such situations, an alternative approach called “salting” can be employed. Salting involves breaking down the largest partitions into smaller chunks, thereby reducing data skewness.

Memory Tuning

After determining the appropriate memory allocation to ensure that your executors do not run OOM, the next step is to focus on optimizing garbage collection (GC).

Before you start fine-tuning GC, you will first need to obtain the GC logs. To capture these logs, simply include the following in your executor’s JVM args when executing the spark-submit command:

-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps

The GC logs can be retrieved from the stdout of an executor, which can be found on yarn-ui -> Executors tab

Analyzing GC

When analyzing GC results, there are two key aspects that require attention:

Throughput: A higher percentage indicates that your CPU cores are spending more time on actual computation rather than collecting garbage. It is desirable to aim for a percentage between 95% to 98%. Additionally, you can explore the graphical and tabular data provided by the tool to assess the distribution of GC pauses (both major and minor).

Heap Utilization: Examining heap utilization is essential, as we typically end up over-provisioning the heap during the early set–up stage. To determine the ideal heap size, run your application with the highest expected load and retrieve the GC logs from the executor with the highest write size (if the application does not write anything, select the executor with the longest task time). Use this to identify the peak heap usage. Add a 10–20% buffer on top of this value, which will give you the recommended heap size. Keep in mind that Spark provides a 10% memory overhead, so consider this when selecting the executor memory size.

Handling Data Exchange

You might raise a question, “What if a task running on an executor needs to access data from a different partition that is running on another executor?”

To support this, Spark internally uses a process known as “shuffling”, which involves redistribution and exchange of data between partitions across the cluster, requiring data to be transferred over the network. However, shuffling can be resource-intensive and time-consuming due to the movement of data across the cluster. Therefore, minimizing shuffling is essential for optimizing your Spark jobs. Techniques like data partitioning, broadcast variables, and leveraging efficient operations like map-side aggregations can help reduce shuffling.

Sharing != Caring

Contrary to the conventional belief that ‘sharing is caring’, in the context of the Spark world, it proves more advantageous for each executor to prioritize self-interest and avoid sharing data with others. This approach stems from the fact that sharing data over the network can introduce unwanted delays. Let’s explore some powerful techniques that enable us to minimize data shuffling and turbocharge our I/O operations:

Column Pruning: Imagine picking juicy fruits from a tree. You wouldn’t waste your time plucking unripe or unnecessary ones, right? Similarly, in Spark, selecting only the vital columns early on in your dataframe transformations saves us from processing needless data. Additionally, we can further optimize by applying filters or predicates as early as possible in our transformations.
Let’s say you have a Dataframe called ‘employeeData’ with columns id, name, age, department, and salary and you want to perform some transformations and calculations but you’re only interested in the name and salary columns.

In this example, the select() method is used on the ‘employeeData’ DataFrame to specify the columns - name and salary. This operation prunes the unnecessary columns from the DataFrame and returns a new DataFrame called ‘prunedData’. Then a filter is applied on it to extract names starting with “a”. This targeted approach reduces the data size and improves processing speed resulting in improved application performance.
Broadcast Variables: Picture this: instead of sending individual messages to every worker, why not deliver a message once to all of them simultaneously? Broadcasting variables does exactly that! By sharing essential variables with all workers in one go, you eliminate redundant data transfers. This means less time wasted shuffling data and more time spent on actual computation. However, be careful of broadcasting large and complex variables! They are stored in each executor’s memory only as a serialized form, and their deserialized heap representations are not shared across tasks. Therefore, the deserialization cost is paid on every task.
Caching and Persistence: Ever wished you could avoid doing the same task over and over again? Caching intermediate results or frequently accessed datasets in memory or on disk does just that. It’s like having a shortcut to skip redundant computation and I/O operations.
Caching data in memory creates a fast-access storage for Spark, allowing swift reuse of prior computations, ultimately saving time and resources. Memory caching is particularly effective when your datasets fit comfortably within the available memory space, ensuring rapid access without the need for costly disk operations.
Alternatively, when datasets are too big for memory, caching to disk is beneficial. While disk caching might not provide the same lightning-fast access as memory caching, it’s still significantly faster than recalculating the data from scratch. It involves writing the data to disk in a serialized format, which reduces the storage overhead compared to caching the entire object in memory.

Code Optimization

While adjusting Spark parameters plays a vital role in optimizing performance metrics, it is equally important to focus on code optimization. Relying solely on parameter tweaks can only take us so far. Code optimization revolves around minimizing unnecessary computations, reducing data transfers, and maximizing resource utilization. Streamlining our code empowers us to achieve significant efficiency gains, enabling us to push the boundaries of performance even further.

Despite Spark’s inherent ability to optimize code and generate an optimized execution plan, comprehending the intricacies of your code can still present a challenge. However, you can gain valuable insights into the code flow by examining the Directed Acyclic Graph (DAG) of stages/tasks displayed on the YARN UI. Note that DAGs of dataframes are usually more complicated as Spark utilizes the Catalyst Optimizer to further optimize its execution plan. This can be analyzed in detail using thedf.explain() method.

Method Matters

It is common for us to fall into the trap of utilizing the first method that solves our problems, often neglecting to consider how these methods provide the desired functionalities. For instance:

flatMap() vs map() + flatten(): While both approaches yield the same result, they differ significantly in their execution. flatMap() avoids the creation of intermediate collections for each element, leading to reduced memory usage and potentially faster execution, especially when dealing with large datasets. However, this approach limits the flexibility to perform operations between the mapping and flattening steps.
reduceByKey() vs groupByKey(): Using reduceByKey() can reduce unnecessary data shuffle required while reducing a key-value pair-based data.
mapPartition() vs map(): Both provide you a transformation capability but differ in one key aspect: mapPartition() runs your transformation on one partition while map() runs your transformation on each row. This could be helpful if you have heavy classes being initialized while transformation is happening.
take() vs collect(): Calling collect() is an easy way to run out of memory on your driver. If not needed, always resort to take(), which will get you the first ‘n’ records it encounters.

By considering and exploring various methods, we can make informed choices that optimize memory usage, execution speed, and overall performance.

Wrap Up!

From understanding the quirks of your data to utilizing Spark’s awesome features, it’s all about maximizing performance and minimizing bottlenecks. Think of it as being the pit crew for your data race car, constantly monitoring and tweaking to reach victory lane. So, gear up and plunge into the realm of Spark optimization, where you can transform your sluggish pickup truck into a Formula 1 racecar!

Learn more about technology careers at BlackRock.

Cranking the Voltage on Spark: Achieve Peak Performance with Optimization was originally published in BlackRock Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.