Stories by Drew Seewald on Medium

Clean Maps in Python with Geopandas

Drew Seewald — Wed, 26 Jun 2024 12:02:38 GMT

Tricks for easy to read, production quality maps

Continue reading on Medium »

ICYMI — March 2024

Drew Seewald — Tue, 02 Apr 2024 11:01:59 GMT

ICYMI | March 2024

ICYMI — March 2024

Recap of my Data and AI posts from the past month

Let’s wrap up the month with a roundup of what’s caught my eye the data, AI, and machine learning space this month. Let’s unravel what’s been on my mind this month.

Photo by Daniel Schludi on Unsplash

Python is Always the best, except when it’s not

Python is always faster! Saving those 5ms of processing time isn’t worth no one on my team knowing how the code works.

Polars is always faster! My data isn’t big enough for the speed up to make sense for me to learn a new package.

Scala is always faster! Maybe there are fewer conversions in spark to convert to the actual execution plan, but spark SQL converts to the same execution plan.

Mac is always faster! Well I can’t for the life of me figure out how to do simple tasks on one, so it’s always going to be slower for me.

Give it a rest, not everyone needs to use the same tools. The ones you have and know how to use can still get the job done just fine.

Variance in AI RAG Model Performance

3% accuracy on LLaMA-2–7B on benchmark questions?!

It sounds unreal, but how you provide few-shot examples during in-context learning can lead to huge spreads in answer accuracy. As shown in a recent paper from Washington University, simple formatting like spaces, new lines, and colons can lead to accuracy ranges of 76 accuracy points (and as low as 3% like the one I mentioned). Theses prompts look the same to you and me, but they can be the difference between a model excelling in answering a question correctly and failing miserably.

What’s even more interesting, is how this sensitivity to few-shot example formatting is seen on every LLM, from GPT-3.5 to LLaMA-2. Even worse? The sensitivity can still be observed when increasing model size, number of few-shot examples, or doing instruction tuning.

There isn’t a format for few-shot examples that is universally good either. The researchers showed that a format performing well on one model didn’t mean much in terms of how it would perform with another model. Could we game this though? Maybe, if you know what format your model tends to have consistent performance on, you will likely see good results with that model.

What should we do know that we know this? Well the researchers suggest that instead of reporting accuracies on the common benchmarks with a single accuracy score, ranges could be presented instead. Who cares if the next GPT-5 has an even better accuracy score on the benchmarks if you aren’t formatting your few-shot examples the same way? How is that a fair comparison to other models? Maybe this is something Hugging Face could find a way to integrate into their leaderboards…

You can check out the paper here: https://arxiv.org/abs/2310.11324

The code is also released on GitHub: https://github.com/msclar/formatspread

Developers are really good translators

They take vague requirements from stakeholders and convert them to work in the strict and structured ways that are required for programming.

Some tasks are super easy and well defined. If you want to find an exhaustive list of valid next chess moves or load a file using Python, both of these tasks have very clear acceptance criteria. They don’t require many steps to complete correctly. These are ideal tasks to have AI help you with, but large language models (LLMs) aren’t up to the task of replacing developers today or anytime soon. Real software is so much more complicated.

I spend a surprisingly small amount of time as a data scientist actually coding and doing heads down work. In my field, almost every request is pretty unclear and undefined, so much of the work is figuring out how to define metrics, where to find data, what the problem being solved actually looks like when it’s done. It even means that sometimes you have to go back to your stakeholder and let them know their request isn’t going to be possible to complete.

Most problems are poorly defined and AI models are pretty bad at knowing what they don’t know. I don’t care how cool the devin.ai demos look, AI isn’t going to steal your job anytime soon.

What can help you stay competitive today? Use the tools at your disposal where it makes sense to do so. Need to code a feature in your project? There are a whole range of models that can help you get that done quickly.

Check out the Stack Overflow blog post that inspired this section: I really enjoy their content and it always gets me thinking about what the next big thing might be.

Microsoft AdaptivePaste

AI is great at generating sample code. I love being able to type plain language into GPT-4 and get back a nice code block that I can drop right into my Python notebook. But usually I’ll get a message after the code block telling me to “replace col1, col2, category, number with the actual column names in your DataFrame.”

Well it turns out some researchers at Microsoft came up with a way to automatically identifying and replacing the variables in copied code (like from ChatGPT or Stack Overflow) with the correct variables already in your code.

The researchers deployed their method to a plugin that was suprisingly good at identifying and replacing variables in copied code. AdaptivePaste can be trained to adapt source coe with 79.8% accuracy! Even more importantly, AdaptivePaste saved nearly 4 minutes vs human developers in some tasks.

Microsoft Blog on AdaptivePaste: https://www.microsoft.com/en-us/research/blog/microsoft-at-esec-fse-2023-ai-techniques-for-a-streamlined-coding-workflow/

Check out the pre-publication paper here: https://arxiv.org/abs/2205.11023

Daylight Saving Time

Unfortunately, March still includes a time change for some of us in the form of daylight saving time. Daylight saving time also happens to be the bane of my existence.

Nothing compares to how inconsequential times and dates feel 99% of the time, yet no other small thing has such a monumental impact on my programming and data processes.

So friendly reminder, if you have to handle daylight saving time this weekend, double check your code before you push to production, otherwise you might be springing forward even earlier than you hoped.

My favorite resources for handling dates and times:

Python datetime documentation: https://docs.python.org/3/library/datetime.html
Handling datetimes in R with lubridate: https://rstudio.github.io/cheatsheets/html/lubridate.html

As a bonus, I also gathered some of my thoughts on handling the basics with dates and times in Python in the article below.

Working With Dates in Python

Should you Code “Easy” Tasks?

If a coding task is easy, does that mean you shouldn’t create a method or attribute for it?

I saw a GitHub issue in Python where the maintainers politely said no to adding a very specific date format to the datetime library. It felt like a bit of a niche format, so I can see why they didn’t want to add it to the standard library.

At the end of the issue, the maintainer mentioned that they might consider adding an attribute to help with the request, but since the code to get the answer was so simple, it might make sense to do it yourself.

Meanwhile, lubridate, an R package for handling dates and times has so many functions for getting unique components of datetimes. Looking at the code, sometimes the functions are as simple as a comparison to the hour to see if the time is in the am or pm.

It’s two different development philosophies, but I tend to lean towards having more methods and attributes so I never have to remember or figure out how to do the calc again each time I do it.

Spaces or Tabs while Coding?

Spaces or tabs when coding?

It isn’t something I see come up too often, especially in the Python world, but it does come up from time to time. The argument is very similar to Databricks vs dbt or pandas vs polars. The decision isn’t always up to you and the benefits aren’t the sole driver of the decision.

Take whatever language you work in. Does it support using both? Only one? Then the decision might be made for you.

What about your team? Do they use one or the other? What does the style guide say? Do you even have a preferred style guide on your team for writing code? You can’t force tabs if everyone on your team is into spaces.

One common argument is that tabs don’t always show up the same across different systems, where 4 spaces is the same distance all the time. Tabs are a single character, saving space that might be precious in your application. But if the team has agreed to use spaces, these still might not be good enough reasons to switch.

Change is hard, and even if there could be benefits to it, it doesn’t always make sense. There’s no need to argue and spend time on how one is infinitely better than the other if the team is set in one way of doing things and it truly doesn’t matter that much at the end of the day.

Getting model results to be used is the hardest part of Data Science

Data projects aren’t done when the model is build and tuned.

We talk a lot about how difficult it is to get quality data and clean it up to use in models, and how that part of a data project can take 80% of the total time but have you ever tried getting people to adopt your model and use it in their processes?

You can build the best model ever created that runs in a fraction of a second and costs nothing to run, but if you can’t get it deployed or incorporated into the business user’s processes, you haven’t finished the project.

This adoption of the model is much easier if you start working with your stakeholders early on and they are invested in the improvement your model provides. They need to agree with how you deploy and the value the model creates for them so that they can bring it to their team and make an actual difference.

AI watermarking

Watermarking AI content is a key step to fighting deepfakes and misinformation.

Hugging face published a blog post last month detailing different methods for watermarking images, text, and audio content. It couldn’t have come at a better time, as it’s harder than ever to know if what you’re looking at is real. The proliferation of tools to create AI-generated content makes watermarking a necessary part of the AI toolkit.

These tools provide a way to prevent data from being used to train more AI models, help identify AI created content, and help document the provenance of digital media.

Some techniques are as simple as an indicator in an image. A few extra bits that can be read to say “I’m AI generated!” Others embed metadata about the image. Still others modify the image so that to a human it is still normal, but AI algorithms have trouble reading them properly.

Watermarking isn’t always easy though. While there are methods for detecting text, they are so inconsistent that OpenAI shut down their tool for detecting ChatGPT output because of low accuracy.

Check out the post to learn more about the details of watermarking AI content: https://huggingface.co/blog/watermarking

Knowing the right tool for the job is critical for success

I recently saw a post from a Python developer who had a client send them data, but the data was text in images. So they did the first thing that comes to every Python dev’s mind and used OCR (Tesseract for text extraction) to get the data out easily.

They coded up the solution, did a bit of testing and tweaking to get it right, and eventually got the text out of the images without having to do too much manual clean up after.

So what’s the issue? Microsoft PowerToys includes a Text Extractor utility, which by pressing Win + Shift + T you can select and extract any text on your screen. It works extremely well and it one of the tools in my arsenal to help tackle problems. It’s incredibly easy to use, but if you didn’t know it existed, you never would have known there was such an easy way to do it.

The free version is available as part of Microsoft PowerToys: https://learn.microsoft.com/en-us/windows/powertoys/text-extractor

Test yourself!

I think we can all agree that courses, certifications, and even college degrees aren’t the only way to demonstrate your skills to employers. Projects, blogs, newsletters, posts, and videos are all great ways to show off what you’ve been working on and how you applied your skills.

…But that doesn’t mean that every certification is meaningless. In the data world, I put a bit more stock in certifications that expire eventually. Technology and best practices evolve quickly, so having to demonstrate the latest skills every once in a while makes sense. It does come at the cost of, well, paying a large corporation each time you test.

Speaking of testing and certifications, I just earned the Databricks Generative AI Fundamentals badge. Databricks does offer a compelling solution if you company is looking to build AI use cases, so it’s worth comparing to other offerings out there.

AI really changes the build vs buy conversation

A few years ago I would have went with an off the shelf solution almost every time. There are so many options out there and many vendors are very open to feedback or will work with you to add features. In so many cases, there is someone out there offering exactly what you need.

With the incredible advancements in large language models over the past year, I can see the build vs. buy conversation being very different. There are so many options for developer augmentation tools through copilots and chat agents.

Midjourney is a great example of how much can be accomplished with a small team. They are one of the top AI image generation platforms out there, but in October 2023 they reportedly had fewer than 100 full-time employees. A small team with the right tools can do incredible things.

Matryoshka Embedding: Not your Grandma’s Word Embedding

Russian nesting doll word embeddings?

I was checking out the Hugging Face blog when I saw a post about Matryoshka, or Russian Nesting Doll, embeddings. At a super high level, they are regular embeddings that have been truncated to a shorter size of dimensions. Depending on how much of the original embedding is kept, the size and speed of embedding retrieval can be sped up dramatically, even on large-scale, real-world data sets.

The best part is you don’t have to sacrifice much accuracy in embeddings. The secret sauce is how the optimizer is configured during training. The training process creates embeddings for each dimension of embedding. The losses are added up and optimized for all levels. This incentivizes the model to put the most important parts of the embedding at the front of the vector representation, meaning the truncated version retains more information.

Check out the blog post (and play around with the embeddings near the end): https://huggingface.co/blog/matryoshka

Check out the paper: https://arxiv.org/abs/2205.13147

Get the pretrained models: https://github.com/RAIVNLab/MRL

AI isn’t the end all of increasing productivity

AI isn’t going to increase your productivity…

…by itself. Other changes are required to maximize productivity.

I was reading a blog post from Stack Overflow that had a lot of good points about where LLMs fit into the future of work and productivity. Right now, code generation, or codegen, tools rarely get programmers to a 100% finished product. They are excellent at generating code based on requirements and re-writing example code to apply to your specific use case, but the developer still needs to know what they want to accomplish to get the artificial intelligence to provide useful information.

I recently spent some time with someone who has excellent subject matter expertise but doesn’t have a spark coding background. They wanted to do some feature engineering but didn’t know how to write the spark code in Python to get the task done. They turned to ChatGPT and were able to write functioning code that accomplished the task. They didn’t have to wait for me to help them write anything, and I got to spend my day working on something else. Productivity was up!

AI codegen tools are not always perfect though. They have a habit of confidently providing valid sounding information, which is actually false. You’ll hear this called hallucination. Someone who doesn’t have a lot of experience won’t be able to tell these from legitimate code, and even if it is harmless (which sometimes it isn’t!) they will struggle to correct any errors. Productivity is not so up.

I’ve seen a lot of news about Devin.ai. A very cool concept to be sure but goes to show how codegen tools still have a way to go to help developers write quality code effectively. If you haven’t seen it yet, it takes an input task from the user. It then has access to code, command line, browser, and other resources, just like a developer would, and uses them to attempt to solve the problem. It can solve problems and fix errors, but it isn’t that successful yet. Your job is safe for now, or at least until the coding robots evolve farther.

Some other applications for codegen that were touched on in the blogpost:

Code gen great for writing unit tests…but you still need to know what to test for.
Code gen is great for documenting and explaining code…so you can understand what it does and how that will impact new features and enhancements.

I highly recommend going to read the full article for yourself on the Stack Overflow blog: https://stackoverflow.blog/2023/10/16/is-ai-enough-to-increase-your-productivity/

Also check out the Stack Overflow Podcast. They are always talking about interesting things in this space and it’s short, sweet, and to the point: https://stackoverflow.blog/podcast

How can you write code that explains itself?

Could you write code that explains itself?

With programming, there are so many ways to reach the same outcome. Sometimes one way is subjectively better, but there are some times when there is an obviously superior way to complete a task.

I’ve seen many people getting different elements of a spark datetime using the substring function. It works, but what on earth does a substring that starts at the 1st character of a string and is 4 characters long mean?

I like to use functions that describe what they are doing. That way, the next time I look back and need to know if the feature I just created is the month or year of a datetime, the function tells me straight up which it is.

Check out the examples. The 2 result variables should get the same output, but only the 2nd one uses functions purpose-built to explain themselves.

Screenshot by the author

That’s all for March! If you’re looking for more consolidated content like this, be sure to follow me for a monthly download of what I’ve been looking at each month.

Drew Seewald - Medium

Working With Dates in Python

Drew Seewald — Mon, 11 Mar 2024 12:47:35 GMT

Daylight saving time and time zones are easy!

Continue reading on Medium »

ICYMI Data & AI — February 2023

Drew Seewald — Mon, 04 Mar 2024 13:47:34 GMT

Recap of my favorite data and AI discoveries from the past month

Continue reading on Medium »

Unleash the Power of AI on Your Own Data with NVIDIA Chat

Drew Seewald — Fri, 01 Mar 2024 13:47:36 GMT

A simple way to get your hands dirty with LLMs using your your own data

Continue reading on Medium »

I Made A Python Geopandas Cheat Sheet

Drew Seewald — Thu, 21 Sep 2023 13:16:31 GMT

Making a Python Programming Cheat Sheet

Continue reading on Medium »

The False Promise of Virtual Environments

Drew Seewald — Thu, 19 Jan 2023 17:27:56 GMT

Just pip install -r requirements.txt, right? Right…?

Continue reading on Medium »

Upskilling in Public 2023

Drew Seewald — Mon, 09 Jan 2023 13:02:21 GMT

Self Improvement | 2023 Personal Goals

Positioning myself to succeed this year and beyond!

It’s a new year, let’s make it the best one yet!

Intro

With the new year, I’ve been thinking long and hard about what kinds of goals I want to set for myself. I came to the conclusion that I want to take some steps to set myself up for success in a career in machine learning. I asked myself how I could do this while keeping myself engaged while learning new skills and creating a portfolio of projects to show off the cool things machine learning can achieve. How can I continue to move the needle in my professional career while bringing exciting machine learning projects to the world? The answer is learning in public!

Learning in public

What is learning in public? It’s something I stumbled across on LinkedIn a couple years back. I saw people in my network were uploading videos and stories of themselves just working on coding problems and learning new skills. The idea baffled me when I first saw it. Who would want to show off all the things they don’t know in front of their entire online network?

It turns out, this is actually a great way to learn, help others, and demonstrate your problems solving skills. It holds you accountable because there are people out there following up on the projects you’re working on. Sometimes they are offering advice on how to fine tune a section of code you had trouble with. They might point out something incredibly simple that makes the entire thing run 100x faster, helping you learn a very useful method for solving future problems.

How does showing off what you don’t know help others? Nearly everyone has some skill that they are just a bit better at than someone else. One of the most insightful things I’ve heard about teaching is that you don’t need to be the best at something to help someone else learn. As long as you’re learning how something works and distilling it into a simple format for someone else to learn from, you’re helping raise everyone’s skill levels.

You don’t need to be the best at something to help someone else learn

To a prospective employer, the benefits of learning in public are two-fold. They get to see how you work through a problem, solving smaller problems along the way. How do you break down complex tasks into simpler ones? How do you handle unexpected issues while working on a coding task? They also get to see that you are willing to learn, take feedback, and improve yourself while working with others. The highest performing workers aren’t worth much if they give up the second they come across something they don’t understand or they meet someone who they don’t see eye to eye with.

Learning in public is a way for me to stay up to date with cutting edge technology like large language models and exciting visualization techniques. It’s a way for me to work on sharing what I’m learning with others and to show how exciting machine learning can be while applying it in ways that they never would have thought about. I have some ideas of how to apply machine learning to create tools to help teachers and even create unique gifts for friends and family. Learning in public is how I will stay motivated and learning to make these projects a reality.

So learning in public has some serious potential upsides, but how do I plan to do it?

Sharing My Code

First up, most of the projects I intend to tackle over the next year will require code. Lots of code. Where does code go? A version control system like GitHub, of course!

GitHub Desktop for Data Scientists

I will be updating my GitHub to be a friendly home for all my projects. Something easy to navigate that doesn’t look too intimidating for someone who doesn’t necessarily want to dive into all the code. This will let my projects live somewhere organized. If any of my projects grow into something bigger than a one time effort, others can collaborate on them there as well.

Another benefit of GitHub is being able to move my projects across platforms. I have two desktop machines that I like to do my work on, one running Ubuntu and the other running Windows 10. GitHub desktop runs on both platforms and makes it really easy to keep code up to date, even if I’m switching platforms halfway through a project. I’m going with GitHub desktop over a command line interface (CLI) because well designed GUI’s are just easier to understand exactly what’s going on. CLI’s have some benefits, but at the end of the day I don’t want to be struggling with a tool that isn’t even my end product. Saving and sharing should be a painless process and a git CLI just isn’t the way for me to get there.

Working on a project that needs to run well on multiple machines setup very differently is good preparation for working in industry. I originally installed Ubuntu on my main computer to gain more experience working with that platform, but there are still some tools that are easier to work with on Windows. Having experience working with both is something worth having.

So how am I going to make my GitHub approachable?

Notebooks!

As much as I hate to say it, a large portion of the code in my GitHub will probably be in Jupyter Notebooks. In the past I haven’t been too impressed with the standard features of Jupyter Notebooks. My preferred tool for development is a full featured integrated development environment, or IDE, like Spyder that has debugging, variable inspection tools, and other useful features.

One of the strengths of Jupyter Notebooks is they allow both narrative text and code to live side by side, with beautiful code output and visualization included in the end product. This perfectly suits my needs to not only do projects and write code, but to explain why I’m making the choices I am and the struggles I had along the way. Jupyter Notebooks are a great tool for telling learning stories.

Another option with Jupyter Notebooks that I have yet to truly explore is extensions. I’m used to a highly customized text editor like Atom (I know, I’ll switch to VS Code eventually) or an IDE like Spyder that has all my little creature comforts. Jupyter Notebooks can add a lot of functionality like code completion and desktop notifications. I just need to sit down and spend the time to find extensions that make me not miss my other editors.

With my notebooks and code storage needs met, I need to figure out what to put in these notebooks.

New Technology and Courses

To stay relevant in the ever changing data and programming landscapes, I want to pursue some new skills. I wanted to learn things that are not only interesting, but help to build a foundation for other projects I want to work on. I pulled a job description for a data analytics engineer at AirBnb, some highly rated machine learning courses, and some cool visualizations I found on Twitter. These served as the inspiration for some of the skills I want to pursue learning this year and building projects with.

Here’s a preview of some of the skills I want to learn and course I want to take in the coming year:

Apache Airflow — Apache Airflow is a platform that helps programmatically author, schedule, and monitor workflows. Since Airflow is used at some of the largest companies to solve some very large data problems, knowing how to use it felt like a must. Setting it up on a spare machine I have laying around at home will help support my future projects by giving me an automation platform to build with, but also position me to work with it in a future job.
Stable diffusion — Large language models started to have their moment last year, with stable diffusion being one of the coolest projects to me. Unlike competitors like Midjourney, stable diffusion models are available to everyone free of charge. While I don’t consider myself much of an artist, AI art creation tools could be used in a pipeline to create some really cool projects this year.
Deeplearning.ai — Their courses are highly rated and the Machine Learning Engineering for Production (MLOps) Specialization especially should provide some better ideas for how to deploy projects and what best practices are for my personal and professional projects.
Datacamp Courses — While courses can’t teach you everything, they are a good way to get more exposure to new concepts. I want to make progress on Datacamp’s Machine Learning Scientist with Python career track this year to pick up some new skills in the space.

Courses by themselves are just the foundation for the ultimate goal in my GitHub, projects.

My Project Philosophy

Projects are where the real magic is going to happen this year. This is where I’m going to really demonstrate my abilities while bringing some cool concepts to life. I could just do Kaggle projects and take online boot camps all year, but that would fall so short of proving anything. A lot of Kaggle projects have been done to death (titanic, housing values, etc.). Many of them provide easy data to work with, barely requiring any preparation to apply machine learning to. No, the real magic is in original projects.

Original projects require much more real world problem solving and work. They require defining your own requirements, gathering and combining data sources, building out pipelines, and testing different approaches to find something that works. The courses are the starting point to learn concepts, but my own projects is where I want to synthesize those skills into something bigger.

Which projects I choose is going to be determined by several factors. I want to do projects that force me to interact with people who don’t think in the programmatic way that I do. I want to work with people to help them understand how they can benefit from machine learning. Synthesizing my skills and the domain expertise of non-data experts has been a large part of my professional career and that’s where I see the most value. I want to help people understand the crazy leaps in technology that we are seeing today and how they can be applied in new places.

Sharing My Output

Obviously none of this is possible without sharing with the world. That’s really the whole point of this upskilling and learning in public journey. My guides and project struggles are going to be shared on Medium like they have been. My hope is to launch some video content this year, but I don’t have concrete plans for that yet. Condensed versions of the Medium content will likely go up on Twitter and LinkedIn like normal, along with anything visually appealing that lends itself well to those platform.

A big part of putting my content out there is the potential benefits others can see from it. Maybe they find it entertaining to see a wacky project I’ve completed or they learn something from it. If I can produce content that entertains or gets people interested in the machine learning space, I’ll consider that a success.

Conclusion

I’ll be the first to admit, these are some big plans for 2023. I found a job description that I wanted to pursue somewhere down the line and tailored my courses and learning for the next year towards things that would me in a good position to take on fun projects that will also get me closer to something in the machine learning field. Everything also overlaps with my personal interests in new and exciting machine learning projects, which should help keep me motivated to keep working at it for a long time.

All this being said, the best laid plans require a lot of work and can still go awry. At the very least I have a vision for where I want to take my career long term and plenty of ideas for the steps to take to get there. Thanks for reading and good luck to all of your plans for 2023 as well!

Join Medium with my referral link - Drew Seewald

R Has So Many %/>* Operators!

Drew Seewald — Wed, 04 Jan 2023 13:01:54 GMT

Relax, it’s just some R operators stuck together

Continue reading on Medium »

The 9 Best Atom Packages for Boosting Coding Productivity

Drew Seewald — Thu, 24 Feb 2022 17:17:58 GMT

These packages make data analysis projects so much easier

Continue reading on Geek Culture »