Stories by Sriram on Medium

Monorepos : Building with Nx: An ExpressJS and NextJS Application

Sriram — Wed, 21 Dec 2022 05:27:44 GMT

Monorepos : Building with nx: An ExpressJS and NextJS Application

Nx: Next.js | Express

What are Monorepos?

Monorepo: Not a monolith

The textbook definition of a monorepo is “A monorepo is a single repository containing multiple distinct projects, with well-defined relationships.” Which sounds an awful lot like the monolithic architecture, which is really hated in the world of “microservices”. Nothing against monoliths considering Instagram is one of the largest monoliths out there. But, we have seen the cons of using a monolithic; but wait

✋ Monorepo ≠ Monolith

A good monorepo is the opposite of monolithic! Read more about this and other misconceptions in the article on “Misconceptions about Monorepos: Monorepo != Monolith”.

TLDR:

Everything at that current commit works together.
Changes can be verified across all affected parts of the organization.
Easy to split code into composable modules
Easier dependency management
One toolchain setup
Code editors and IDEs are “workspace” aware
Consistent developer experience

Source: https://monorepo.tools/#why-a-monorepo

Is Google still using monorepo?

Believe it or not, Google is one of the biggest monorepo users in our industry. I’m not making this up, their code base is huge, as you can probably imagine, and reports state that they have 95% of it inside the same repository. I rest my case 🎯

In all seriousness, I’m just documenting my learning for a project at work. Anyways #LearningInPublic.

Creating a simple api to Fetch Cricket Players 🏏

Let’s start with creating a simple express app and make a REST controller that fetches the data from the our dataset.

Let’s initialize a repository. Initialize a repository using:

npx create-nx-workspace — preset=express

Opens an interactive shell and initializes the repo; repository name = nx-cricket; application name = nx-cricket-api (if you wish to follow to the t)

nx serve nx-cricket-api # to test the app

Directory Structure 😅

So far so good

I’ve downloaded the dataset https://data.world/raghav333/cricket-players-espn and cleaned the dataset and wrote a simple python script to convert it into a JSON to make it easier export it as a typescript array. Note: the entire Code/Dataset is available on GitHub.

export interface Cricketer {
    Id: string
    Name: string;
    Country: string;
    'Full name': string;
    Age: string;
    'Major teams': string;
    'Batting style': string;
    'Bowling style': string;
    Other: string;
}

export const cricketers: Cricketer[] = [
    {
        "Id": "0",
        "Name": "Henry Arkell",
        "Country": "England",
        "Full name": "Henry John Denham Arkell",
        "Age": "84",
        "Major teams": "Northamptonshire",
        "Batting style": "Right-hand bat",
        "Bowling style": "",
        "Other": ""
    }, ...
]

Define a Simple Controller in nx-cricket-api>src>main.ts

Lets just define two simple controllers to retrieve all the cricketers and Query endpoint to return cricketer by name.

// Returns all the cricketers 
app.get('/cricketers', (_, res) => {
  res.send({cricketers});
});

// Returns cricketer: Lazy Search by Name
app.get('/search', (req, res) => {
  const q = ((req.query.q as string) ?? '').toLocaleLowerCase();
  res.send(cricketers.filter( ({ Name }) => 
    Name.toLocaleLowerCase().includes(q)
  ));
});

localhost:3333/cricketers

localhost:3333/search?q=Barb

Let’s create a simple frontend with Next.js

First we have to install the nrwl/next if it has not already been installed. Then we create a simple next app, a next app boilerplate is generated. We add a simple application that renders the list of cricket players

# Installing nrwl/next

# yarn
yarn add --dev @nrwl/next

# npm
npm install --save-dev @nrwl/next

# create nx app
nx g @nrwl/next:app
name = nx-cricket-search
style = css

Api, Frontend and Test folders generated

Removed all the boilerplate and added this basic frontend code

import { useEffect, useState, useCallback } from 'react';
import React from 'react';

import { Cricketer } from '@nx-cricket/shared-types';

export function Index() {
  const [search, setSearch] = useState('');
  const [cricketer, setCricketer] = useState([]);

  useEffect(() => {
    fetch(`http://localhost:3333/search?q=${escape(search)}`)
      .then((resp) => resp.json())
      .then((data) => setCricketer(data));
  }, [search]);

  const onSetSearch = useCallback(
    (evt: React.ChangeEvent) => {
      setSearch(evt.target.value);
    },
    []
  );

  return (
    
              style={{ padding: '10px', margin: '20px' }}
        value={search}
        placeholder="Enter Cricketer Name"
        onChange={onSetSearch}
      />
      
        {cricketer.map(({ Id, Name, Country, Age }) => (
          
            {Name}, {Country}, {Age}
          

        ))}
      

    

  );
}

export default Index;

Lazy Search :)

You can also add server-side rendering. But for this tutorial I’ve decided to keep it simple.

Shared types

One of the best features of monorepos is the ability to use shared types, if you see the frontend code, we have strongly typed the cricketer array, this was because we were able to add the type ~ interface in shared-types folder from the Cricketer.ts file. This is one of the most useful features of the monorepo structure.

# Creating a shared Library
nx g @nrwl/node shared-types

# In libs>shared-types>src>index.ts add | already found in our Cricketer.tsc
export interface Cricketer {
    Id: string
    Name: string;
    Country: string;
    'Full name': string;
    Age: string;
    'Major teams': string;
    'Batting style': string;
    'Bowling style': string;
    Other: string;
}

# Cleanup cricketer.ts
import type { Cricketer } from "@nx-cricket/shared-types"

export const cricketers: Cricketer[] = [
    {
        "Id": "0",
        "Name": "Henry Arkell",
        "Country": "England",
        "Full name": "Henry John Denham Arkell",
        "Age": "84",
        "Major teams": "Northamptonshire",
        "Batting style": "Right-hand bat",
        "Bowling style": "",
        "Other": ""
    },...
]

# type { Cricketer } from "@nx-cricket/shared-types" can be accessed from any project

Dependency Graph for our Project: generated with nx graph

Crypto Tax: Blessing in Disguise?

Sriram — Sat, 05 Feb 2022 09:32:44 GMT

In the Union Budget of 2022, the government of India announced that it would be launching a digital rupee and would start taxing income from virtual digital assets. A 30 per cent tax on any income from the transfer of virtual digital assets to be precise. The term “virtual-digital” asset might seem redundant in usage as something that is “virtual” (at least with the technology available now) by definition has to be functionally digital. Defining terminologies in the crypto space has always been one of the most challenging tasks for the govt. In late 2021 there were many ambiguities with the definitions of cryptocurrencies or crypto assets, which started an Indian Crypto Ban hoax, causing unnecessary chaos, which probably led to the definition of the term “Virtual Digital Assets”.

In the explanatory memorandum of the Finance Bill, the government stated, “To define the term “virtual digital asset”, a new clause (47A) is proposed to be inserted to section 2 of the Act. As per the proposed new clause, a virtual digital asset is proposed to mean any information or code or number or token (not being Indian currency or any foreign currency), generated through cryptographic means or otherwise, by whatever name called, providing a digital representation of value which is exchanged with or without consideration, with the promise or representation of having inherent value, or functions as a store of value or a unit of account and includes its use in any financial transaction or investment, but not limited to, investment schemes and can be transferred, stored or traded electronically. Non-fungible Token and any other token of similar nature are included in the definition.”

Now that the definition of virtual digital assets has unequivocally been clarified (sarcasm), what does the “monstrous” tax on crypto assets mean to an average citizen? Purely from a policy-making perspective, I find large taxes on certain commodities to be counterproductive if it does not cause an intrinsic rejection among the citizens to be associated with that particular commodity. We have seen this to be the case with alcohol, tobacco etc. And to say that India has accepted blockchain as the future of good governance and to impose a huge tax on crypto assets might seem hypocritical at first glance. But is that the case?

As someone passionate about de-fi (decentralized finance), blockchain technology and Web3, I have never quite understood the reason behind cryptocurrencies being so volatile, as the gas fee (transaction fee) on transacting from popular chains like Bitcoin and Ethereum to be extremely high (for a reason!) especially for someone to make a living out of day-trading crypto assets. And I find no other compelling reason but the fact that people/traders measure or predict the value of Cryptocurrencies purely using the tools and techniques that have been applied to the equity and forex markets. Now it is a simple question of whether people investing in cryptocurrencies are in it because they believe in the technology or have a cliche algorithm running on their computers/mobile phones that predicts that their investment would reap extraordinary benefits in the immediate future? The latter has led to many youngsters blindly investing in cryptocurrencies without understanding the consequences. Moreover, extensive advertising, rising NFT hype and our favourite Elon “The DogeFather” Musk phenomenon has played a persuasive role in making crypto investments. Further justifying the government’s concerns regarding crypto trading.

My friend’s silly solution to this taxation was to mine bitcoin and buy groceries from the dark web. As absurd as it may sound, he captured the essence of the technology (or at least some of it ) more than many who make a living out of it. And this shows the bigger picture of the “Crypto Conundrum” in India. People who invest in crypto because they believe in the technology(“the damned hodlers”) and people who invest in crypto because they want to diversify their portfolio would be far less affected by this taxation than your average “Dogecoin to the Moon” Joes and “Rainbow Ape NFT Collector” Billys.

Finally, I would like to draw attention to the word “transfer” that I highlighted in the first paragraph. To put it naively, you would only have to pay the taxes when you get a capital gain out of it, or simply put the money you make on a profitable trade. When you have invested in the technology that facilitates the most secure transactions, why not transact using the same! Making day trading profits on crypto is a classic example of failed “Indian Jugaad”.

Why use a banana as a stand for incense sticks when you can eat the damn fruit!

Thus I hope this law incentivises people to learn about crypto, read the whitepaper before investing in it and use it the way it was intended to be used.

Osintgram: The untold side of Instagram

Sriram — Thu, 13 May 2021 16:30:46 GMT

DISCLAIMER: This is not a “How to hack someone on Instagram Tutorial”. But rather an awareness post on how people get scammed on the internet and how to protect yourself from getting hacked.

Firstly, I would like to make something clear. If you intend to hack someone you’ve come to the wrong place. It is absolutely contemptible if you want to hack someone without their consent and with the tools available, it is highly unlikely that you get away with it! I believe in the principles of transparency of data. The information that I post here are publicly available and anyone can access it. And unfortunately getting hold of this and making it work is easier than you think. And I strongly believe that people should be aware of such scams. Having said all that hope is not lost, there are straight-up measures to make sure you are safe from getting hacked.

If you are not into the technical details but just want to learn how to protect yourself from the attack go to the end of the post.

The Script goes as follows:

1.OSINTGRAM

Well to start things off, what is OSINT?

OSINT, otherwise Open Source Intelligence is a multi-methods methodology for collecting, analyzing and making decisions about data accessible in publicly available sources to be used in an intelligence context. In simpler words, these are publically available information that can be used for data analysis, data collection etc.

Osintgram is essentially a computer program that uses the Instagram API to gather information. On paper, there is nothing illegal about it, and it’s beautifully written code ( credits to the developers ). The more I think about it there are so many practical applications!

GitHub - Datalux/Osintgram: Osintgram is a OSINT tool on Instagram. It offers an interactive shell to perform analysis on Instagram account of any users by its nickname

The account is a dummy account created for educational purposes

Apart from flaunting the rather “typical hacker screen” terminal window, the developers have written code simple yet efficient code in your favourite language, C++( Just kidding it’s written in python XD ). But having gone through the code I must say it is just simple Instagram API calls. With which you can gather the following information:

All registered addressed by target photos
Target’s photos captions
A list of all the comments on the target’s posts
Total comments of target’s posts
Target followers
Users followed by target
Email of target followers
Email of users followed by target
Phone number of target followers
Phone number of users followed by target
Hashtags used by the target
Total likes of target’s posts
Target’s posts type (photo or video)
Description of target’s photos
Download target’s photos in the output folder
Download target’s profile picture
Download target’s stories
List of users tagged by target
A list of user who commented target’s photos
A list of user who tagged target

But the only hope is that all this information is accessible if the account is public or the account of the victim is followed by the perpetrator. So as a general rule of thumb do not follow some account you have no clue about. And as far as public accounts are concerned, this is process is rather computationally intensive and impossible to retrieve information ( at least for your everyday hacker who googled “how to hack someone on Instagram” ). And the information about the account of the hacker will be gathered at Instagram’s end.

The main scope of this article is complete but for the sake of demonstration on how a typical script will be written. I’ll be continuing with some more steps.

2. Blackeye

Blackeye is yet another Social Engineering tool that is available publicly on the internet. This allows anyone to host a dummy version of a well-known website to get information like the username and password. This is a far more powerful tool, at the same time, it can be easily detected. The website will have to be hosted ( mostly on temporary platforms like ngrok, serveo etc. )

So as a general rule of thumb, never open rather anonymous links especially ones ending with .ngrok.

But the original authors of the script have taken it down. Having said that there are many modified versions of the OG Blackeye is pretty easily accessible.

3.SET

SET ( Social Engineering Toolkit ) This is a popular tool usually packed with the default installation of Kali Linux(or any pen-testing distro for that matter). This is a swiss-army knife for social engineering, essentially gives you a list of tools for basic social engineering. The tool was intended to simulate an actual phishing mail for typical red-hat hackers (ethical).

So typically the perpetrator would create a dummy account and follow your Instagram account. Then would extract the information from your account using OSINTgram and gather information like ( say the email ids of your followers ). He would then send a string of spam emails to all your followers and would provide the link to a dummy website hosted using blackeye and people who ignorantly and log in with this link would compromise their credentials :(

Is all hope lost?

NO! this “scam” or most of the prevailing scams requires a lot of luck and a continuous series of careless moves by the victim. And many of these could be avoided with simple steps.

1.Use a spam filter!

All email services have spam filters, here is a link to a step-by-step guide to adding spam filters:

https://www.rightinbox.com/blog/gmail-spam-filter

2.Do not click on unknown or suspicious links.

Most of the popular organisations host their links from their own server and it is highly likely that the domain name contains the name of the organisation and the website in it clearly. If the URL doesn't explicitly give that out avoid clicking on that link. And to double ensure I recommend using Virus Total and check if the website is safe to be visited.

3.Avoid accepting follow requests from suspicious / rather unknown accounts.

I must say it is quite difficult to make an anonymous account these days that don't get flagged almost instantly. And it is unlikely that the hacker would get away with it. But nevertheless, it always better to not get hacked and go through the whole process.

4.Get yourself educated! Do not be technology ignorant!

Do follow cybersecurity updates the latest trends at least the most popular ones.

How to setup a Pseudo-distributed Cluster with Hadoop 3.2.1 and Apache Spark 3.0

Sriram — Fri, 14 Aug 2020 13:04:09 GMT

https://techcrunch.com/2015/07/12/spark-and-hadoop-are-friends-not-foes/

This post is an installation guide for Apache Hadoop 3.2.1 and Apache Spark 3.0 [latest stable versions] based on the assumption that you have used Big Data frameworks like Hadoop and Apache Spark before and you want to try out the latest versions of the Hadoop and Spark environments for development purposes. Needless to say I will cover the fundamentals Apache Hadoop and Apache spark.

Note: This installation is not meant to be used in a real-life / production environment. My next post will cover the setup for a multi-node cluster setup for a production environment.

What is the difference between Stand-Alone mode and pseudo-distributed mode?

Single Node (Local Mode or Standalone Mode)
Standalone mode is the default mode in which Hadoop run. Standalone mode is mainly used for debugging where you don’t really use HDFS.
You can use input and output both as a local file system in standalone mode.

You also don’t need to do any custom configuration in the files- mapred-site.xml, core-site.xml, hdfs-site.xml.

Standalone mode is usually the fastest Hadoop modes as it uses the local file system for all the input and output.

Pseudo-distributed Mode
The pseudo-distributed mode is also known as a single-node cluster where both NameNode and DataNode will reside on the same machine.

In pseudo-distributed mode, all the Hadoop daemons will be running on a single node. Such configuration is mainly used while testing when we don’t need to think about the resources and other users sharing the resource.

In this architecture, a separate JVM is spawned for every Hadoop components as they could communicate across network sockets, effectively producing a fully functioning and optimized mini-cluster on a single host.

So, in case of this mode, changes in configuration files will be required for all the three files- mapred-site.xml, core-site.xml, hdfs-site.xml.

HDFS and MapReduce:

Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation.The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. Thus allowing the dataset to be processed faster and more efficiently than it would be in a more conventional computer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking. In short HDFS gives us a base to store large datasets which is distributed among multiple nodes and a faster and more efficient data retrieval technique using the MapReduce programming model.

Useful Resources:

Apache Hadoop 3.2.1

https://en.wikipedia.org/wiki/MapReduce

The base Apache Hadoop framework is composed of the following modules:

Hadoop Common — contains libraries and utilities needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) — a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
Hadoop YARN — a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications
Hadoop MapReduce — an implementation of the MapReduce programming model for large-scale data processing.

The term Hadoop is often used for both base modules and sub-modules and also the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm. In this post we shall install Apache Spark along with Hadoop.

Installation of Hadoop:

Pre-req:

A Linux distribution system (vm should work fine, but it is not recommended )
Sudo privileges
A Decent computer with stable internet connection (ony for downloading the necessary software)

Installation:

Install Java

sudo apt update
sudo apt install openjdk-8-jdk openjdk-8-jre
# this command is for an ubuntu system

2. See the Hadoop Wiki for known good versions. I used java version 8. Verify your installation using java -version.

(base) sriram@sriram-Inspiron-7572:~$ java -version
openjdk version "1.8.0_265"
OpenJDK Runtime Environment (build 1.8.0_265-8u265-b01-0ubuntu2~20.04-b01)
OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)

3. To change the Java version used

(base) sriram@sriram-Inspiron-7572:~$sudo update-alternatives --config java
There are 2 choices for the alternative java (providing /usr/bin/java).

Selection    Path                                            Priority   Status
------------------------------------------------------------
  0            /usr/lib/jvm/java-11-openjdk-amd64/bin/java      1111      auto mode
  1            /usr/lib/jvm/java-11-openjdk-amd64/bin/java      1111      manual mode
* 2            /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java   1081      manual mode

Press  to keep the current choice[*], or type selection number:

I have moved my java-8-openjdk-amd to /usr/local/ ( personal preference ). I suggest you follow the same for the sake of the tutorial or note down the location properly for $JAVA_HOME

4. Add JAVA_HOME to ~/.bashrc

Note: bashrc is a very powerful file, changes made to this file can corrupt your system. Nonetheless use the file carefully make sure you don’t delete / add unnecessary lines here. In this tutorial (and every tutorial) you’ll find that instructors suggest you use the nano / vi text editor. People from pure windows background might find it hard to use hence I would recommend you use gedit/subl for this. (just replace nano/vi with gedit)

$ sudo nano ~/.bashrc #to open bashrc

scroll to the end and paste these lines

# JAVA VARIABLES
export JAVA_HOME=/usr/local/java-8-openjdk-amd64 
export PATH=$PATH:$JAVA_HOME/bin

Save and close (ctrl + s and ctrl + x for nano)

5. Install ssh and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons if the optional start and stop scripts are to be used. Additionally, it is recommended that pdsh also be installed for better ssh resource management.(In Ubuntu)

(base) sriram@sriram-Inspiron-7572:~$ sudo apt-get install ssh
(base) sriram@sriram-Inspiron-7572:~$ sudo apt-get install pdsh

Make sure you add this line in you ~/.bashrc file

# this line is to ensure pdsh uses ssh
export PDSH_RCMD_TYPE=ssh

6. Setup passphraseless ssh

Now check that you can ssh to the localhost without a passphrase:

$ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

Once you are done it should like this:

(base) sriram@sriram-Inspiron-7572:~$ ssh localhost
Welcome to Ubuntu 20.04.1 LTS (GNU/Linux 5.4.0-42-generic x86_64)

* Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

* Are you ready for Kubernetes 1.19? It's nearly here! Try RC3 with
   sudo snap install microk8s --channel=1.19/candidate --classic

https://microk8s.io/ has docs and details.

2 updates can be installed immediately.
0 of these updates are security updates.
To see these additional updates run: apt list --upgradable

Your Hardware Enablement Stack (HWE) is supported until April 2025.
*** System restart required ***
Last login: Fri Aug 14 13:17:31 2020 from 127.0.0.1

7. Download and extract hadoop 3.2.1 software package in the location of your choice.

$ wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz #DOWNLOAD
$ tar xzf hadoop-3.2.1.tar.gz #EXTRACT
$ mv hadoop-3.2.1 hadoop #rename
$ mv hadoop /usr/local/

You can manually download from the given link and extract the files and place it in any location. I placed hadoop at /usr/local/

Apache Download Mirrors

8. Set Hadoop environment variables

add this line to your /etc/environment file

PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin"JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/jre"

Add these lines of code to your bashrc file

# this line is added so that the environment file which contains $HADOOP_HOME, which is needed for running "hadoop" command anywhere in the system (multi-environment)
source /etc/environment

export HADOOP_HOME=/usr/local/hadoop 
export HADOOP_MAPRED_HOME=$HADOOP_HOME 
export HADOOP_COMMON_HOME=$HADOOP_HOME 
# this line is used to compile the java code in 64bit compiler instead of default 32bit (this will not affect functionality but will improve performance) this is associated with the WARN.
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_HDFS_HOME=$HADOOP_HOME 
export YARN_HOME=$HADOOP_HOME 
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native 
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin 
export HADOOP_INSTALL=$HADOOP_HOME

To update the variables run source ~/.bashrc

9. ** Edit Config Files **

This is the most important section of the module. Follow the steps carefully.

Add these lines in the tags of the following lines *
replace the existing … tags

$HADOOP_HOME/etc/hadoop/core-site.xml:



  hadoop.tmp.dir
  /usr/local/hadoop/tmpdata


  fs.default.name
  hdfs://127.0.0.1:9000

$HADOOP_HOME/etc/hadoop/hdfs-site.xml:



  dfs.data.dir
  /usr/local/hadoop/dfsdata/namenode


  dfs.data.dir
  /usr/local/hadoop/dfsdata/datanode


  dfs.replication
  1

$HADOOP_HOME/etc/hadoop/mapred-site.xml:

 
 
  mapreduce.framework.name 
  yarn

$HADOOP_HOME/etc/hadoop/yarn-site.xml:



  yarn.nodemanager.aux-services
  mapreduce_shuffle


  yarn.nodemanager.aux-services.mapreduce.shuffle.class
  org.apache.hadoop.mapred.ShuffleHandler


  yarn.resourcemanager.hostname
  127.0.0.1


  yarn.acl.enable
  0


  yarn.nodemanager.env-whitelist   
  JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME

10. Edit Hadoop-env.sh

The hadoop-env.sh file serves as a master file to configure YARN, HDFS, MapReduce, and Hadoop-related project settings.

When setting up a single node Hadoop cluster, you need to define which Java implementation is to be utilized. Use the previously created $HADOOP_HOME variable to access the hadoop-env.sh file:

Note the value:

(base) sriram@sriram-Inspiron-7572:~$ $JAVA_HOME
bash: /usr/local/java-8-openjdk-amd64: Is a directory

>> /usr/local/java-8-openjdk-amd64<< and open hadoop-env.sh file

sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Uncomment the $JAVA_HOME variable (i.e., remove the # sign) and add the full path to the OpenJDK installation on your system. If you have installed the same version as presented in the first part of this tutorial, add the following line:

export JAVA_HOME= /usr/local/java-8-openjdk-amd64

11. Format the file system

$ bin/hdfs namenode -format

12. If everything as gone well till now you should be able to see this, you have successfully installed the standalone version of hadoop.

(base) sriram@sriram-Inspiron-7572:~$ hadoop version
Hadoop 3.2.1
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r b3cbbb467e22ea829b3808f4b7b01d07e0bf3842
Compiled by rohithsharmaks on 2019-09-10T15:56Z
Compiled with protoc 2.5.0
From source with checksum 776eaf9eee9c0ffc370bcbc1888737
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.2.1.jar

13. Verify the installation

(base) sriram@sriram-Inspiron-7572:~$ start-all.sh
WARNING: Attempting to start all Apache Hadoop daemons as sriram in 10 seconds.
WARNING: This is not a recommended production deployment configuration.
WARNING: Use CTRL-C to abort.
Starting namenodes on [localhost]
Starting datanodes
localhost: datanode is running as process 33621.  Stop it first.
Starting secondary namenodes [sriram-Inspiron-7572]
sriram-Inspiron-7572: secondarynamenode is running as process 33832.  Stop it first.
Starting resourcemanager
Starting nodemanagers
(base) sriram@sriram-Inspiron-7572:~$ jps
35475 Jps
33621 DataNode
35111 NodeManager
33832 SecondaryNameNode
34954 ResourceManager

Name Node

Data Node

Resource Manager

PORTS: [localhost]

8080: Resource Manager

9870: Name Node

9864: Data Node

My bashrc file:

# this line is added so that the environment file which contains $HADOOP_HOME, which is needed for running "hadoop" command anywhere in the system
source /etc/environment

# JAVA VARIABLES
export JAVA_HOME=/usr/local/java-8-openjdk-amd64 
export PATH=$PATH:$JAVA_HOME/bin

# HADOOP VARIABLES
export HADOOP_HOME=/usr/local/hadoop 
export HADOOP_MAPRED_HOME=$HADOOP_HOME 
export HADOOP_COMMON_HOME=$HADOOP_HOME 
# this line is used to compile the java code in 64bit compiler instead of default 32bit (this will not affect functionality but will improve performance) this is associated with the WARN.
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_HDFS_HOME=$HADOOP_HOME 
export YARN_HOME=$HADOOP_HOME 
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native 
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin 
export HADOOP_INSTALL=$HADOOP_HOME

# this line is to ensure pdsh uses ssh
export PDSH_RCMD_TYPE=ssh

# SPARK VARIABLES
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

10. Test a Basic Command

HDFS Commands - GeeksforGeeks

// Guess what the code does ? // (answer at the end)

(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ bin/hdfs dfs -mkdir /user
(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ bin/hdfs dfs -mkdir /user/sriram
(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ hdfs dfs -mkdir /input
(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ hdfs dfs -put etc/hadoop/*.xml /input
2020-08-14 15:16:02,263 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-08-14 15:16:03,116 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-08-14 15:16:03,300 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-08-14 15:16:03,759 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-08-14 15:16:03,931 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-08-14 15:16:04,104 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-08-14 15:16:04,288 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-08-14 15:16:04,405 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-08-14 15:16:04,524 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+'
2020-08-14 15:18:41,134 INFO client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032
2020-08-14 15:18:41,853 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/sriram/.staging/job_1597397135082_0001
2020-08-14 15:18:42,045 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-08-14 15:18:42,277 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/sriram/.staging/job_1597397135082_0001
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://127.0.0.1:9000/user/sriram/input
 at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:332)
 at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274)
 at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:396)
 at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310)
 at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327)
 at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)
 at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570)
 at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567)
 at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1588)
 at org.apache.hadoop.examples.Grep.run(Grep.java:78)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
 at org.apache.hadoop.examples.Grep.main(Grep.java:103)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
 at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
 at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ bin/hdfs dfs -cat output/*
cat: `output/part-r-00000': No such file or directory
cat: `output/_SUCCESS': No such file or directory
(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$

The changes reflected in the HFDS

(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ cat output/part-r-00000 
1 dfsadmin
1 dfs.replication

Code: [answer]

Made a dir: input on the HDFS
hdfs dfs -put etc/hadoop/*.xml /input : puts all .xml files in input
Returned every file that started with dfs into output

10. Finishing things off

(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ stop-all.sh
WARNING: Stopping all Apache Hadoop daemons as sriram in 10 seconds.
WARNING: Use CTRL-C to abort.
Stopping namenodes on [localhost]
Stopping datanodes
Stopping secondary namenodes [sriram-Inspiron-7572]
Stopping nodemanagers
Stopping resourcemanager

Apache Spark

Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads — batch processing, interactive queries, real-time analytics, machine learning, and graph processing. You’ll find it used by organizations from any industry, including at FINRA, Yelp, Zillow, DataXu, Urban Institute, and CrowdStrike. Apache Spark has become one of the most popular big data distributed processing framework with 365,000 meetup members in 2017.

logo

Apache Spark vs. Apache Hadoop

Outside of the differences in the design of Spark and Hadoop MapReduce, many organizations have found these big data frameworks to be complimentary, using them together to solve a broader business challenge.

Hadoop is an open source framework that has the Hadoop Distributed File System (HDFS) as storage, YARN as a way of managing computing resources used by different applications, and an implementation of the MapReduce programming model as an execution engine. In a typical Hadoop implementation, different execution engines are also deployed such as Spark, Tez, and Presto.

Spark is an open source framework focused on interactive query, machine learning, and real-time workloads. It does not have its own storage system, but runs analytics on other storage systems like HDFS, or other popular stores like Amazon Redshift, Amazon S3, Couchbase, Cassandra, and others. Spark on Hadoop leverages YARN to share a common cluster and dataset as other Hadoop engines, ensuring consistent levels of service, and response.

In this post I will not dive deep into the spark framework, but give a quick installation guide.

Installation:

Apache Download Mirrors

Download the file from the above link and place it at /usr/local
Add the following lines on bashrc(change the location if you have extracted in a different place)

# Spark Variables
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

3. Add the following lines on $SPARK_HOME/bin/load-spark-env.sh

export SPARK_LOCAL_IP="127.0.0.1"

4. Verify installation

start-all.sh # To start all hadoop-daemons
spark-shell --master yarn # start spark with YARN

2020-08-14 17:53:50,165 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2020-08-14 17:54:00,660 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = yarn, app id = application_1597405003831_0005).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

msris108/BIG_DATA-PROJECTS

Check out my GitHub repo that covers basics of Spark and SparkML. More articles on spark and sparkml will be posted soon.