Stories by Yashika Sharma on Medium

Tech Certifications: Are they worth it?

Yashika Sharma — Mon, 15 Jul 2024 17:28:51 GMT

There’s always been a lot of debate about whether tech professionals need certifications and if they’re even worth it. You’ll often hear that hands-on experience is what really counts, and many recruiters might not even look at certifications on your resume. While that’s definitely true, I want to share a different perspective.

The Debate: Experience vs. Certifications

Photo by Kenny Eliason on Unsplash

First things first: Certifications are not meant to replace real-world experience or hands-on projects. They’re a good addition but not a necessity. So, why might it still be a good idea to get some certifications? Let me explain.

I’m a big advocate for learning by doing. I didn’t go to a fancy university; most of what I know comes from online courses, working on tons of projects, and my experience as an engineer with various companies over the past six years.

Certifications vs. Certification Exams

Photo by Scott Graham on Unsplash

There’s a difference between getting a certification just for finishing a course and passing a certification exam. Some courses give you a certificate for completing the modules and maybe a final project. For instance, when I started with machine learning and deep learning, I was all over the place with online documentation and YouTube videos. It was helpful, but I lacked structure.

That’s when I found the Deep Learning Nanodegree from Udacity. It offered a structured path with hands-on assignments and projects that really helped me grasp the material. This kind of certification is more about having a structure to follow and showing that you’ve completed a learning journey.

Certification exams are a different ball game. As a data engineer, I try to stay updated with new advancements. Some certification exams like the Azure Data Engineer Associate or Google Professional Data Engineer can be really beneficial if you are in this field. These exams are more recognised in the industry. While they might come with official learning modules, you can also pair them with YouTube tutorials or Udemy courses for a broader understanding.

When I was preparing for the Azure Data Engineer exam, I used my real-world experience with Data Engineering and Azure and supplemented it with hands-on exercises, which were part of the Data Engineering with Microsoft Azure course. It was fun to test my knowledge and pass the exam.

Why Bother with Certifications?

Photo by Emily Morter on Unsplash

Some might say, “Why not just follow a course or look up documentation as needed? Why specifically do you need to pass the exams?”

That’s a fair point, but realistically, how many people actually finish the courses they start? Additionally, finding the right documentation can be time-consuming if you’re not sure where to look.

I see certifications as a motivator — an end goal to keep you on track. You don’t need a certificate for every tool or technology, but if you’re really interested in learning something in-depth or switching to a new platform, it can be beneficial. For example, if you’re experienced with AWS and need to work with Azure, a certification exam can help you quickly get up to speed with Azure-specific concepts.

Things to Consider Before Getting a Certification

Time: Do you have the extra hours to study?
Relevance: Is the certification relevant to your role, field, or interests?
Cost: If you can’t afford it, look for financial aid (for example, Coursera offers this) or see if your employer has a professional development budget.
Practical Experience: Do you already have exposure to this area and want to take it to the next level, or will you have opportunities to apply what you learn?

Conclusion: Are Certifications Worth It?

Photo by Alexander Grey on Unsplash

In the end, whether or not to pursue certifications comes down to your personal goals and circumstances. They can be a great way to structure your learning and validate your skills, but they shouldn’t replace hands-on experience. Use them as a tool to complement your practical knowledge.

Think about whether a certification is worth it for you based on your time, interest, and goals. Certifications can enhance your knowledge and career if you approach them with the right mindset. Just remember, they shouldn’t replace hands-on experience and they shouldn’t be pursued just for the sake of credentials but rather for practical use. Sometimes, you might even benefit from having them on your resume, especially if you have corresponding hands-on experience.

Feel free to reach out if you have any questions or if you have a different opinion. I’m always up for a discussion! 😊

Data Governance from your Terminal

Yashika Sharma — Thu, 02 Mar 2023 16:42:15 GMT

Data engineers, software developers, sysadmins. To the untrained eye, we’re all just computer folks.

But to us in the know, we’re far apart when it comes to performing our day jobs, with different challenges and different toolkits. That said, there is one thing that we all tend to know and love: our beautiful terminals. Doing things via command-line instead of dragging your mouse through an interface is the kind of thing that once you get used to, there is no turning back.

And the same applies to Alvin: once you get used to managing and consuming your metadata right from the terminal, well… let’s just say you’ll start asking yourself how you lived without it.

The power of Alvin’s metadata in your terminal? Let’s check out how it works!

Introducing: the Alvin CLI

With the Alvin CLI, you can use all the main features of our tool directly in your favorite terminal:

Add, remove and modify new platforms;
Perform impact analysis on your schema changes;
Support for dbt models;
Bulk apply (and remove) tags;
Analyze upstream and downstream column-level data lineage of your assets;
Add and remove lineage for your assets;
View usage statistics of your columns, tables and dashboards.

But hey, seeing is believing. So let me show you, in my humble opinion, some of the coolest features for analyzing and managing your metadata in the terminal.

Regression Testing

As I mentioned before: once you get used to the terminal, you never want to leave. And well, we get you.

Regression test for dropping a table, with the impacted entities listed by platform.

Whenever you need to drop or change a column or a table, test your SQL and reveal any downstream breaking changes, without leaving your terminal.

Support for dbt models

The regression testing can be used for tables, columns and BI elements. But also for dbt models!

When working with them, you can use the CLI to run tests and get a nice report of what you are going to break (hopefully nothing).

Bulk applying (and removing) tags

Want to apply the same tag to different entities at once? The terminal is your friend. You can apply and delete tags to entities in a batch based on keywords and rules!

On the Alvin side, we’ll need a few arguments, but based on the input all the matching entities will have the new tag bulk applied to or deleted from them.

Tag batch apply

Let’s say I want to apply a tag to all the column entities from the platform “bigquery” which exactly matches the rule text “first_name”, domain “name”, and the tag I want to bulk apply is “cli_demo” with tag type business_term and classification type “pii” (by default it is “default”).

Once the command executes successfully, you can go to the UI and check out any entity matching the rule and you’ll see the new tag applied to it.

Tag Batch Delete

Similar to applying tags, you can also bulk delete them from entities based on the rule text and other parameters.

You’ll be asked for confirmation before applying the bulk delete operation, but once you confirm, the mentioned tag from all the matching entities gets deleted right away. Pretty neat.

Usage Statistics

Need to know if a specific entity is being used, how many times it was accessed, and by whom? Yeah, you can see that in the CLI too.

Below is an example of usage statistics from the past 30 days of a column called office. You are also able to see the usage count by user:

Support for tabular, YAML and json formats

Get your data in your preferred format!

For lovers of the classic tabular format.

For all of us json lovers out there.

I don’t miss XML at all, but whatever floats your boat.

And you can not only view your data in CSV, JSON and YAML, but also save the output in those formats!

Sample data in tabular format saved in a CSV.

Want to try it out?

If you are already an Alvin user, the documentation to install and use the CLI is here in our docs.

If not, how about signing up for a free trial or booking a live demo? :)

Data Governance from your Terminal was originally published in Alvin on Medium, where people are continuing the conversation by highlighting and responding to this story.

23 Life Lessons on the day I turn 23

Yashika Sharma — Thu, 05 Jan 2023 15:36:57 GMT

Today I turned 23 and I decided to take a moment and reflect back on the things I have learned so far. Life never stops you teaching lessons and there are tons of things you learn everyday but these are the 23 things I found worth mentioning.

Photo by Daniel Huniewicz on Unsplash

Count your blessings, always be grateful for what you have, everyday.

2. Don’t skip your meals.

3. Kindness is important. To everyone and anyone, including animals, everything that breathes deserves respect and kindness.

4. Your only competition is you. Dont compare yourself with others.

5. Progress is not always linear or fast paced.

6. One thing that I have learned from my mother is to wake up everyday and thanking god for the life before starting your day.

7. People come and go. Sometimes even blood related people wont be there for you and that’s okay, keeping expectations is going to let you down.

8. Patience is extremely important, it might not work today but it might tomorrow.

9. Don’t say things when you’re angry, most certainly you are going to regret it later.

10. Asking for too many opinions will confuse you, trust your instincts.

11. Do things for others without expecting anything back in return. You are doing it because you want to not because you want something back.

12. Leave your nest. Stepping out of comfort zone is the only way for long term growth.

13. Home is where your heart belongs, a ceiling and a few closed walls can’t be called home its just a living space.

14. You don’t have to be an early bird to get work done, I’ve been nocturnal for my whole life.

15. Your university doesn’t matter as long as you have hunger to learn and work hard on your own.

16. Trying something new means there are chances to fail but not trying it at all means you already lost the chances of success.

17. Maturity doesn’t come with age, it comes with experiences.

18. Learning everything at once is a bad idea, learning while doing is the best way.

19. “Don’t make perfect the enemy of good”. Things might not be perfect but still be decent to go forward with.

20. I love startups more than corporate roles. I strive better in a fast paced environment.

21. Always look for opportunities to learn, it doesn’t have to be technical or your field related, the world is full of interesting things to strike your curiosity from.

22. Health is equally important as wealth, healthy person can work towards getting wealth but wealth cannot buy you health back.

23. Embrace every experience you have, good or bad, it happens for a purpose, falling is part of the process but learning from it and moving forward is what doers do.

I plan to write my reflections of the year moving forward but for now I’ll rush to eat pizza and spend time watching my favourite shows :)

Acknowledge Your Privileges

Yashika Sharma — Wed, 04 Jan 2023 15:01:45 GMT

Humans have a nature of complaining and seeking for better things constantly. It doesn’t matter if you are at the start of something or you’ve progressed and came through a long way, the feeling of not having enough is very common and probable to appear.

Photo by Jukan Tateisi on Unsplash

While one should always aim for higher goals and progress throughout their life, a feeling of satisfaction and awareness of acknowledging their privileges and advantages is extremely important. Not only it helps in self fulfilment but also gives a chance to step out of your own little bubble and see how different people across different parts of the world are struggling.

Photo by Ann on Unsplash

It’s easy to take things for granted and not paying enough attention or a feeling of gratitude for the things you have in your life already, overlooked while running in the race of getting more.

Now, if you are wondering what are these privileges which are so obvious yet overlooked everyday, let me give you some examples:

Do you have a plate full of food for all meals in a day? Acknowledge your privilege, millions of people are struggling to put bread on the table and losing lives from hunger.

Photo by Hennie Stander on Unsplash

Complaining about having a small house but have a decent bed to sleep and a place to live? Think about those who are homeless.

Photo by Ev on Unsplash

Have a peaceful enviroment where you dont have to be scared of your life? People are still in the middle of wars.

Photo by Jordy Meow on Unsplash

Got into a college and don’t like the university ranking? So many children still do not have right to education.

Photo by Yannis H on Unsplash

Tired of listening to your parents giving you advices about your betterment? Some people never get a chance to see their mom and dad and some lose them too soon.

Photo on Unsplash

These are just some very obvious things that we have in front of our eyes and yet we don’t appreciate them enough. There are countless number of things that you might have in your life but thousands of people are still struggling to have them even for once.

What might seem normal to you can be a huge privilege that someone else is day and night struggling and wishing for.

Photo by Aaron Burden on Unsplash

Before taking anything for granted, rethink about how that can be a privilege of yours, it will help in self reflection and satisfaction along with gratefulness to appreciate that in your life even more. :)

Write Robust APIs In Python With Three Layer Architecture, FastAPI and Pydantic Models

Yashika Sharma — Fri, 18 Nov 2022 10:22:42 GMT

Github is the ultimate source of project pool for software engineers. You can find plenty of ideas and implementations for almost anything that you can think of. While there are some amazing projects to take inspiration from, there are enough bad examples too.

Working on a software engineering project is more than just coding. Many people make the mistake of rushing directly to coding the idea in an unstructured way and skipping all the steps in between.

Photo by Chris Ried on Unsplash

Today we are going to talk about how to write more structured APIs by following three layers of Software Engineering architecture. Splitting up project into layers helps in abstraction and more manageable structure. Another advantage is, in case you want to change something in a particular layer, for example, the database connection or a framework, you can do that in that individual layer without affecting the other layers.

The three layers of API Architecture

As you can see above, the three layers are :

API Layer
Service Layer
Database Layer

While setting up the project, a structure like below can be a good starting point.

Initial project setup in PyCharm

Above, you can see we have directories set up for each layer with an additional directory called schemas which will hold our Pydantic models.(We’ll talk about this in detail).

We’ll talk about a hypothetical example of building the backend for a website which involves entities like users , orders and items in the end. (We are not going to code the example fully, to keep it short, this post is more focused on explaining the structure and frameworks)

API Layer

This is the topmost layer in the backend architecture and one the user can directly interact with. As a good practise, no business logic should be present inside the API layer. Ideally, API Layer should interact with the service layer via Meta classes.

This layer should be the simplest, and deal with CRUD operations. We can also keep different versions inside the API layer like below

Different API versions

One Layer that we did not mention before is the Schema Layer which is a sub layer under the Interface layer. Broadly speaking, both API Layer and schemas can be considered under the umbrella of Interface Layer.

A good Python backend schema logic would have custom classes defined with data schema and types for validation. We will talk more about this when we’ll talk about Pydantic Models.

Schemas for different entities

Service Layer

Most of the heavy-lifting is done within this layer. All the business logic that shouldn’t be exposed to the user should go inside the service layer. Service layer acts as an intermediate layer between the API layer and database layer. The logic and mapping is done within the service layer to prepare the request as such that it cannot fire a query against the database directly but sends the processed data to the DB layer instead. As a good practise we can breakdown the service layer into meta and implementation directories. As mentioned before, it’s better to use Meta classes holding abstract methods (which can be implemented inside the corresponding service implementation class) to interact with the API endpoints.

Database Layer

DB layer or database layer is the layer where all of the data ingestion, data modification or deletion logic is present. It contains database connectors and models. Before working on any project, a good practise is to write a High Level Design document with the expected data model defined. Specially in cases where you work on big projects, and the task involves modifying the existing data model, a proposal document goes a long way.

The database layer accepts the processed data from the service layer and perform queries and operations to interact with the database. It then returns Response objects that are passed through the service layer eventually to the API layer.

The DB layer will also hold data models that can be created for example using sqlalchemy along with the files for query operations, in this case we use postgres as an example.

Database layer containing the models and postgres directories

Similar to the Service layer, the database layer can also be divided into meta and impl for more granularity.

Fast API

FastAPI Github Repository

That was about setting up the structure, let’s talk about FastAPI now. FastAPI is an amazing web framework that one can use for creating APIs with Python based type hinting. It also comes very handy to work with the layer based structure that we just talked about.

FastAPI has multiple advantages, while it can be used for web development, I mostly find it extremely useful for building APIs.

The official documentation talks about the below cons for opting FastAPI

Key Features of FastAPI

The FastAPI documentation is very detailed but in a nutshell, its a very quick framework based on other powerful frameworks like Uvicorn, Starlette, Pydantic and OpenAPI which makes it more powerful along with native async support. With FasAPI you can write your API function parameters with Python 3.6+ type declarations and get automatic data conversion, data validation, OpenAPI schemas (with JSON Schemas) and interactive API documentation UIs.

For example, you can define data schema and types using Pydantic library which will make sure the validations for request and response payloads are done properly and you can also see documentation of your APIs without any extra efforts using Swagger UI which can also be extended and improved by writing docstrings in the API endpoint function itself.

One might question why to use FastAPI for REST API creation when we already have frameworks like Django REST framework and Flask RESTful.

Sure we do, but if you have tried to create an API with Django before you’d know that the process is similar to creating a web application because you have to use the Django Application model. Comparatively FastAPI is far better and easy to maintain compared than that.

Performance-wise as well FastAPI out-wins Django and Flask and is on par with NodeJS and Go.

https://www.techempower.com/benchmarks/

Pydantic

If you know Python types well, Pydantic models are easy to understand as well.

Pydantic library offers type hints at runtime in a sense that you can create classes with attributes and use Pydantic to define types for them. You can also define default values while creating Pydantic classes(also called as Pydantic models).

Some of the advantages of using Pydantic are:

Editor support, linting and autocomplete works well. Running lint script will help you catch any errors with validating the data types and schema while working on the project itself.
Faster than other similar libraries
Validating data, especially with recursive complex Pydantic models is easy.
Converts data types of attributes automatically wherever applicable. This means you can pass the same object you get from a request directly to the database, as everything is validated automatically and similarly from the database directly to the client.

Now that we have learned more about the architecture and frameworks, let’s touch base on our prior example and see some FastAPI endpoints and Pydantic model examples in code.

Below is the schema prepared for entity UserAs you can see we have created three classes inherited from Pydantic’s BaseModel with attributes and expected types along. These classes can be used to validate the request and response objects for endpoints related to User

Pydantic models for entity User

We can create endpoints for different entities structured like this and route them under v1 using router.py

Routing APIs under router.py

An example endpoint for getting user details based on email_id is shown below.

Get user details based on email_id

As you can see, we are calling the get_user function from the Service Meta Layer (UserServiceMeta in this case)

get_user abstract method in Meta Service class

This method can be fully implemented inside the user_service.py file in Service implementation that in turn can call the final function from the postgres_user.py file from Database layer which will involve read and write operations to the database.

postgres_user.py in database layer

One final thing in this section is the model inside the database layer. In this example I am using sqlalchemy as ORM to define the User model for the database.

After implementing all the layers and connecting things together, next step is to test the endpoints. I usually do that with Postman(this is out of scope for this post).

You can also see your API documentation using Swagger UI. Below is an example of a GET endpoint from official documentation.

API documentation on Swagger UI

Bonus Part

Now that we have covered the primary focus of the post, I want to quickly talk about Typer.

If you want to build a CLI along with the FastAPI project, Typer can be a good option to consider. Both FastAPI and Typer are created by
Sebastián Ramírez. In his own words

Typer is the FastAPIs of CLIs

Typer is based on Click(another tool for building CLIs) so you get all its benefits, plug-ins, robustness, etc as well as Rich (Python library for rich text and beautiful formatting in the terminal).

Commands like typer.secho outputs beautiful text in the terminal with giving you options to choose the colors as well.

Example output from the documentation.

I have built CLI with Typer for the projects built with FastAPI backend in the past. Some of the advantages I found by using Typer are:

It allows type hinting
Easy to use
Fast and short. Don’t have to write a lot of code to implement quick commands.
A very well written documentation

Typer doesn’t use Pydantic directly as of now like FastAPI but there have been discussions on Github such as this one to request the support. Perhaps someday we’ll see Pydantic in Typer as well.

Diving deep into “How to build CLIs with Typer” would be a topic for another post.

Thank you for reading!

Resources:

Be The Bigger One

Yashika Sharma — Tue, 02 Aug 2022 00:02:20 GMT

In the human life as long as you are alive there will be multiple instances of people lying, hurting or harming you. Most of the times it’s all mental pain and suffering which makes you think about things from a completely different perspective.

Photo by Fred Moon on Unsplash

When you’re close to someone, you put all your faith in them, being assured they will never harm you but it doesn’t mean the other person has the same intentions.

Some of these incidents will leave pretty long living scars over your memory and will shape the prejudice you have to react to situations, judging people and making decisions.

While all of this is not something that can be easily controlled, one thing that can be done is to “Be the bigger one”.

As long as you hold on the feeling of what happened to you, it’s going to hurt even more.

These things sound very philosophical but when you can’t control anything, one thing you can control is “how you react to it”.

When forgetting is not an option, forgiving is. And if that’s too difficult then acknowledging what happened and moving on is the only thing you can do in present for your future self to thank you.

It take guts, maturity and self awareness to be the bigger one. It’s important to accept how you’re feeling, even if it’s bad, acknowledge it. Ignorance is a temporary fix if at all.

It takes multiple significant experiences to realise this but once you become bigger one moving on will be easier. It might not seem like it in the moment but the future will bring you the assurance of the right decisions you make in the present.

PS: This is not a post break up draft, this is just a 3 am thought!

Why what’s working for them might not work for you?

Yashika Sharma — Tue, 31 May 2022 21:02:22 GMT

The grass is greener on the other side(these are eggs but one of them is looking at the grass)

The motivation behind this post is the feelings I had or sometimes still have. I have to constantly remind myself about what I want and if still doing the things that I might not want to do would really make me happy.

Now these are of course just my thoughts and people might disagree, there might be edge cases or your story/background might be different and I totally respect that. This is just what I’ve experienced and I’ve seen people around me feeling.

With that disclaimer, let’s start imagining two cases, two loops that you are entering into, not because you want to but because they look great from the outside.

Loop 1

People enter into loops only because it looks good from outside but the world inside is entirely different(and the one you probably dislike a lot).

Imagine you are an aspiring undergraduate, you love learning new skills, working with communities and building good hackathon projects.

You are looking at a classmate of yours bagging a very good and well paying internship at a FAANG(or MAANG now). You aren’t jealous of them(or might be) but you are feeling bad for yourself and thinking about why you didn’t get it in the first place.

You start the leetcode grind, solving questions out of the woods, with no clear goal in mind but with just the flames within chanting “you didn’t get the damn FAANG internship”. You do it until you feel burned out and leave it midway.

Now you are back to what you enjoy, maybe open source contributions, maybe some development or something else until you find another “pleasing loop” for yourself.

Loop 2

I did it before and now I am going to do it again. Loop 2 here i come……

A few days later while scrolling youtube you see some videos with millions of views. Guess what? It’s time for the next “pleasing loop”.

You watch someone creating you tube videos high in content quality….. you watch more……. think more and decide to stalk more.

You see they have thousands of followers on twitter, linkedin and every possible platform you can think of.

Your mind to you:

“Why couldn’t I do this? I should have started posting content regularly, these concepts are the same stuff I learned. But….I think I might not have gotten so many views and followers but…..I should have done this. Let me try this now. I’ll create YOU TUBE Videos now!”

You enter the second endless loop of following what pleases you but something that you don’t love or wanted to do yourself.

You again waste a couple of days/weeks or months in the loop and leave it midway with nothing.

Why are you doing what others are doing? Time to retrospect!

Enough of loops, can we retrospect?

Because …….

You don’t know what you want to do(yet).
You know what you want to do but you don’t have the courage to do it further(thinking about ‘what’ and ‘ifs’)
You are very jealous of the person you are stalking(its normal but also very harmful for your inner peace)
You think their path might be your path too and what worked for them will work for you as well. (No, this is not how it works)

BUT… The question is….Is that what you really want?

Raise your hand to answer(to yourself)

Taking an example from the above two loops you stuck in for months, ask these questions to yourself:

Loop 1(Bagging a big company internship dream):

Is getting a FAANG job your ultimate goal?

2. Is this going to help you achieve what you want?(if you know what your goals are, if you don’t its fine)

3. Is it the work at FAANG or just the (so called)status symbol? Are you wanting this for yourself or your relatives who would eventually say a few good words on your face but in reality they don’t care at all?

Loop 2(The short term, gaining followers passion):

Is your ultimate goal getting famous?

2. Does being around people makes you happy?

3. Is this thing something that you’ll do for months and would never get bored really?

If the answer to most of the questions above is NO or I don’t think so. My friend, what’s working for them might not work for you.

WHY?

Let the part time inspired passions run away, focus on what YOU want?

Because that is not what you want. Those things that are looking fancy to you right now might be something that you think will make you happy but it’s not the truth.

Even if you work until you burnout and match the pace of what the other person is doing, it’s not going to help.

At some point, most certainly you’ll again try to break out of the loop since you entered in it with no interest and this was expected to happen.

Takeaway

In case the loop examples didn’t seem very specific to your case, the overall idea is still the same. Often time people enter endless loops just because things are working for others. There’s always more to the story. What you are able to see is only the half truth. Everyone has their hustles and hard work before getting to where they are right now. And if the end goal is not what you actually want, you’ll not be able to progress at any case.

The path is built from motivation and passion, and if you don’t have these then you are just trying to replicate someone else’s footsteps and while this might work sometimes, most of the times you’ll find yourself stuck in the loop.

So always, think before blindly running behind anything. Your own dreams and passions deserves your energy.

This post is to only make you realise that what they are doing is not what you want. And even if it is what you want, the exact same path might not work.

I hope this can slightly contribute in helping you to find answers to whatever you are looking for. I love hearing different perspectives, if you’d like to chat or just share your views feel free to comment or reach out to me on Twitter.

Decoding Word2Vec Part by Part

Yashika Sharma — Sat, 20 Jun 2020 13:31:46 GMT

Link

I was reading about Word2Vec and realized there are so many features and important information but all scattered. I had to read different blogs before completely understanding the logic. If you are like me and love to supplement reading the paper with blogs, this is the right place.

What is Word2Vec?

If you are exploring NLP you must have come across this word. Let me first introduce some other topics to build the intuition.

If you want to do exciting things with text including sentiment analysis, question-answering, text similarity, topic modeling, and what not you are going to feed words/sentences/documents (corpus 😍) to the model.

But the machine doesn’t understand the text right? So you would convert it into a machine-readable form.

For that:

The word vector is the thing you are looking for.

Word Vector is the representation of words in the vector form. Well, you can just one hot encode the words and match the encoded vectors but that won’t solve the purpose. Imagine having 1 Million words in the vocabulary and then encoding all those words to pass to the network. Why not learn to encode similarities to vectors?

Word representations in vector space (2-D)

Word representations in vector space (3-D)

Crazy Computation! Not Possible(Even if it is why would I kill my machine with heavy math and 0's?)

What before Word2Vec?

Before Word2Vec N-grams were used to capture the meaning of the word given n accompanying words before context word but N-grams cannot capture the context.

The overall probability would be P(current_word|n_words).

To enhance similarities we need embeddings. Don’t just cluster words, seek representation that can capture the degree of similarity.

Context includes words in a fixed window around the word in a text

What’s the measurement of quality in Word2Vec?

Its the similarity of words in a task.

Advantages:

Computationally efficient
Accurate
Performs well on finding the semantic and syntactic similarity

Okay, but how does these **WORD VECTORS WORK**?

Each word is encoded in a vector(as a number represented in multiple-dimension) to be matched with vectors of words that appear in a similar context. Hence a dense vector is formed for the text. The vectors are based on the features.

Word vectors are sometimes called word embeddings or word representations. They are a distributed representation

See how related words are around ‘expect’, it’s because those word vectors are similar

Word Embedding gives the meanings of words with the help of vectors. Subtractions of vectors of some words gives rise to meaningful relationship.

For Example:
King-Man+Woman=Queen
(Vking-VMan+VWoman=VQueen, where V=Vector)

By using word embedding we use a fully connected layer and its weights are called embedding weights(whose values are learned during training the model just like the other layers like Dense, CNN, etc are learned).

And this embedding weight matrix turns out to be a lookup table.

Without word embedding, you’d encode the text and then multiply that with the hidden layer. For 200 words you will one hot encode and then multiply with the hidden layer. That will give you mostly 0’s in the output. How inefficient is that?

Word Embedding comes to rescue. Since multiplication of any One hot encoded vector with weight matrix is the corresponding row itself, we just assign unique integers to word and then take the corresponding row from the lookup table. Thus there's no need to multiply.

Multiplying any OHE vector gives just the corresponding row.

So an embedding layer is a layer having embedding weights that are learned during training.

From the Udacity’s deep learning nano degree

Say there is a word heart whose index is 958. Now the 958th row of the embedding matrix will be the output and will be moved forward to the hidden layer.

These weights in the lookup table are just vector representations of words. Columns in these matrices represent the embedding dimension. Any word having the same meaning has the same representation.

Finally, let’s discuss Word2Vec

Word2Vec model uses this concept of embedding and lookup. Based on the word of interest and context it understands and learns the weights to prepare the matrix. This prepared matrix is embedding which understands the similarity in words.

The words in a similar context have similar representation. Word2Vec find these similarities and relationships between them during training and hence prepare a master vector representation called embedding.

Similar words are near each other and dissimilar are far in the representation.

This is all about Word Vectors, Embeddings, and Word2Vec. Keep an eye out for understanding SkipGram in part-2. Show your love by clapping 👏.

Drop your questions in the comments or reach out on Twitter.

Recommended Resources:

Decoding Word2Vec Part by Part was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

Understanding Count Vectorizer

Yashika Sharma — Thu, 21 May 2020 13:47:58 GMT

Whenever we work on any NLP related problem, we process a lot of textual data. The textual data after processing needs to be fed into the model.

Since the model doesn’t accept textual data and only understands numbers, this data needs to be vectorized.

Reference

What do I mean by vectorized?

Before we use text for modeling we need to process it. The steps include removing stop words, lemmatizing, stemming, tokenization, and vectorization. Vectorization is a process of converting the text data into a machine-readable form. The words are represented as vectors.

However, our main focus in this article is on Count Vectorizer. Let’s get started by understanding the Bag of Words model:

Bag of Words(BoW)

Reference

As already mentioned, we cannot process text directly, so we need to convert it into numbers. The Bag of Words(BoW) model is a fundamental (and old way) of doing this.

The model is very simple as it discards all the information and order of the text and just considers the occurrences of the word. It converts the documents to a fixed-length vector of numbers.

A unique number is assigned to each word. Within the length of the vocabulary(vocabulary means a collection of all the unique words), the frequency of words is assigned. This is the encoding of the words, in which we are focusing on the representation of the word and not on the order of the word.

There are multiple ways with which we can define what this ‘encoding’ would be. Our focus in this post is on Count Vectorizer.

Count Vectorizer:

CountVectorizer tokenizes(tokenization means dividing the sentences in words) the text along with performing very basic preprocessing. It removes the punctuation marks and converts all the words to lowercase.

The vocabulary of known words is formed which is also used for encoding unseen text later.

An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document. The image below shows what I mean by the encoded vector.

Count Vectorizer sparse matrix representation of words. (a) is how you visually think about it. (b) is how it is really represented in practice.

The row of the above matrix represents the document, and the columns contain all the unique words with their frequency. In case a word did not occur, then it is assigned zero correspondings to the document in a row.

Imagine it as a one-hot encoded vector and due to that, it is pretty obvious to get a sparse matrix with a lot of zeros.

The scikit-learn library offers functions to implement Count Vectorizer, let’s check out the code examples.

Examples

In the code block below we have a list of text. Here each row is a document. We are keeping it short to see how Count Vectorizer works.

First things first, let’s do the import. Also, observe document containing the list of documents we are going to process:

from sklearn.feature_extraction.text import CountVectorizer

document=["devastating social and economic consequences of COVID-19",
"investment and initiatives already ongoing around the world to expedite deployment of innovative COVID-19",
"We commit to the shared aim of equitable global access to innovative tools for COVID-19 for all",
"We ask the global community and political leaders to support this landmark collaboration, and for donors",
"In the fight against COVID-19, no one should be left behind"]

The second step is to initialize the object cv_doc for using Count Vectorizer and fitting it on our document:

cv_doc=CountVectorizer(document)

vocab=cv_doc.fit(document)

The text has been preprocessed, tokenized(word-level tokenization: means each word is a separate token), and represented as a sparse matrix. The best part is it ignores single character during tokenization like I and a.

This is how our vocab looks like.

To see the complete vocabulary we can write vocab.vocabulary_ .

Note that the numbers here are not the count, they are the positions in the sparse matrix.

Further, there are some additional parameters you can play with.

Stop words: You can pass the stop_words list as an argument. The stop words are words that are not significant and occur frequently. For example ‘the’, ‘and’, ‘is’, ‘in’ are stop words. The list can be custom as well as predefined.

Define your own list of stop words that you don’t want to see in your vocabulary.

cv1=CountVectorizer(document,stop_words=['the','we','should','this','to'])

#check out the stop_words you sepcified
cv1.stop_words

2. min_df: min_df equals a number specifies how much importance you want to give to the less frequent words in the document. There might be some words that appear only once or twice and may qualify as noise.

What does min_df do?

min_df considers words that are only present in a minimum of 2 documents. We can also pass a proportion instead of an absolute number.

For example, min_df=0.25 ignores words that are present in less than 25% of the document

cv2=CountVectorizer(document, min_df=2)

3. max_df: Similar to min_df there is max_df which indicates the importance you want to give to the most frequent words. There might be some words that are very frequent and you don’t want to include in your vocab, in that case, max_df is used.

It’s opposite to min_df and considers words based on their presence in the maximum n number of documents specified.

Let’s test the proportion instead of the absolute number here. If words are present in more than 25% of the document they are ignored.

cv3=CountVectorizer(document, max_df=0.25)

4. Tokenizer: If you want to specify your custom tokenizer, you can create a function and pass it to the count vectorizer during the initialization. We have used the NLTK library to tokenize our text.

def tok():
 #add your code here

cv4=CountVectorizer(document,tokenizer=tok)

5. Custom Preprocessing: The same goes for preprocessing if you want to include stemmer and lemmatizer for preprocessing the text, you can define a custom function just like we did for the tokenizer. Although our data is clean in this post, the real-world data is very messy and in case you want to clean that along with Count Vectorizer you can pass your custom preprocessor as an argument to Count Vectorizer. Keeping the example simple, we are just lowercasing the text followed by removing special characters.

def preprocess():
 #add your code here

cv5=CountVectorizer(document,tokenixer=my_tok)

6. n-grams: Combination of words sometimes are more meaningful. Let’s say we have words ‘sunny’ and ‘day’, ‘sunny day’ combined makes more sense. This is bigram. We can also use character level and word level n-grams. ngram_range=(1,2) specifies we want to consider both unigrams(single words) and bigrams(a combination of 2 words).

cv6=CountVectorizer(document, ngram_range=(1,2))

7. Limiting Vocabulary size: We can mention the maximum vocabulary size we intend to keep using max_features. In this example we are going to limit the vocabulary size by 20.

cv7=CountVectorizer(document, max_features=20)

Phew! That’s all for now. CountVectorizer is just one of many methods to deal with textual data. The TF-IDF and embeddings are better methods to vectorize the data. More on that later.

To access the code used in this article, Check out the repository here.

Recommended Resources:

Understanding Count Vectorizer was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

EmoTorch

Yashika Sharma — Mon, 16 Mar 2020 22:43:04 GMT

Sample from FER2013 Dataset with labels

EmoTorch is a project built as a part of the Facebook AI Hackathon 2020 using PyTorch. The project aims at predicting the emotion of a person based on the image of their face. The image can be anything ranging from a selfie to an image captured while scrolling the feed via the mobile’s front camera or webcam.

Because of the PyTorch’s diverse modules and packages, that is the main tool used in this project.

This image is then fed to the neural network which extracts the features from the image, analyzes the emotions and predicts the most accurate emotion out of 7 most common emotions.

Most Likely predicted classes

This article is the explanation of the model, the motivation behind the idea and future scope will be in next article.

The EmoTorch repository is explanatory to a great extent but we will overview the project from the top here.

Choosing the Dataset

https://datarepository.wolframcloud.com/resources/FER-2013

The first step in any project is to choose a dataset, we chose the publicly available FER dataset for our task. The reason behind choosing this dataset :

It has images categorized in one of the seven emotions.
Is publicly available
The length of the dataset is suitable for our task with
Training set: 28,709 examples.
Test set: 3,589 examples. -Validation set: 3,589 examples

The data was pulled from a past Kaggle Competition.

There are two files available. First file train.csv contains two columns, “emotion” and “pixels”. The “emotion” column contains a numeric code ranging from 0 to 6, inclusive, for the emotion that is present in the image. The “pixels” column contains a string surrounded in quotes for each image. The contents of this string a space-separated pixel values in row-major order. test.csv contains only the “pixels” column and our task is to predict the emotion column.

The emotions available are as follows

{'0': 'angry',
 '1': 'disgust',
 '2': 'fear',
 '3': 'happy',
 '4': 'neutral',
 '5': 'sad',
 '6': 'surprise'}

However, the model is built in a way to work with any dataset. We can use commercial datasets with our model and get better results.

After choosing the dataset, we preprocessed the images. The images were already centered crop with good to go dimensions. We did not choose to resize and went with just normalizing the images and converting them to tensors.

Data Augmentation

Network Architecture

The motivation of using transfer learning for our task came after we implemented a Deep Neural Network from Scratch. The model built from scratch gives accuracy only around 18%-20%. We boosted the accuracy with the help of transfer learning.

PyTorch’s subpackage model has a variety of pre-trained networks that can be easily downloaded.

For EmoTorch, we tried multiple networks before settling to VGG19.

Initially, we used VGG16 which gave us accuracy below 40% followed by ResNet50 with 41% and DenseNet101 with 42.5%.

VGG19 yields an accuracy of 46% which is better than all other pre-trained models.

Therefore, we decided to choose VGG19 for the implementation

VGG19 Architecture

Model & HyperParamters

The pre-trained models are trained using the ImageNet dataset, which has 1000 classes. Our task is to only classify the images into one of the 7 emotions so we had to alter the classification layer. We prepared our own Network to merge with the vgg19 pre-trained layers.

For this task we chose-

1024 dense hidden layer
ReLu Activation Function
Dropout layers in between the hidden layers with p=0.2
Adam Optimizer
25 Epochs
Batch Size of 64

The Imbalanced Dataset

The distribution of samples per category in the FER dataset is not balanced. The category disgust is least represented with only 547 samples whereas the category happiness is most represented with 8989 samples.

Future Scope lies in augmentation. Multiple balancing techniques can be used to present an equal number of apparations per category which will result in higher accuracy

Data Distribution

Accuracy

The model gives an accuracy of 46% which is due to the dataset we used. The commercial large dataset with high-resolution pictures can outperform and give better accuracy.

Some of the plots we plotted with TensorBoard :

Training Loss

Validation Loss

Valid Accuracy

Few Examples from Testing Set

Top 3 predicted classes

All the class probability

What’s Next?

EmoTorch can be combined with recommendation systems. The image when passed to our model will return the predicted emotion. This emotion can be used by the system to recommend the products.

Often we see recommendation systems working based on user’s watch history or buying history. EmoTorch gives real-time predictions that will help in more accurate recommendations. User can either feed their selfie to the system or front camera can track the facial expressions on the user’s consent. In any case, the image will be then processed by EmoTorch and prediction will be used by the system to recommend songs to listen, movies to watch, products to buy, places to visit and much more.

Contributors to EmoTorch are-

Yashika Sharma
Nathan Curtis
Ahmed Hamido

Visit the project repository.