<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Sriram on Medium]]></title>
        <description><![CDATA[Stories by Sriram on Medium]]></description>
        <link>https://medium.com/@msris108?source=rss-bba4fc74f7a5------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*E358j_BQkSIb7bOdbMWJMg.png</url>
            <title>Stories by Sriram on Medium</title>
            <link>https://medium.com/@msris108?source=rss-bba4fc74f7a5------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Fri, 05 Jun 2026 17:46:51 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@msris108/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Monorepos : Building with Nx: An ExpressJS and NextJS Application]]></title>
            <link>https://msris108.medium.com/monorepos-building-with-nx-a-expressjs-and-nextjs-application-d8246b859487?source=rss-bba4fc74f7a5------2</link>
            <guid isPermaLink="false">https://medium.com/p/d8246b859487</guid>
            <category><![CDATA[express]]></category>
            <category><![CDATA[web-development]]></category>
            <category><![CDATA[nextjs]]></category>
            <category><![CDATA[nx]]></category>
            <category><![CDATA[monorepo]]></category>
            <dc:creator><![CDATA[Sriram]]></dc:creator>
            <pubDate>Wed, 21 Dec 2022 05:27:44 GMT</pubDate>
            <atom:updated>2023-07-13T07:25:07.445Z</atom:updated>
            <content:encoded><![CDATA[<h3>Monorepos : Building with nx: An ExpressJS and NextJS Application</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*3dCo6KkqReWxd0ZK.png" /><figcaption>Nx: Next.js | Express</figcaption></figure><h3><strong>What are Monorepos?</strong></h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/480/0*y_0EgwzG84HQBYFW.png" /><figcaption>Monorepo: Not a monolith</figcaption></figure><p>The textbook definition of a monorepo is “A monorepo is a single repository containing multiple distinct projects, with well-defined relationships.” Which sounds an awful lot like the monolithic architecture, which is really hated in the world of “microservices”. Nothing against monoliths considering Instagram is one of the largest monoliths out there. But, we have seen the cons of using a monolithic; but wait</p><h4><strong>✋ Monorepo ≠ Monolith</strong></h4><p>A good monorepo is the opposite of monolithic! Read more about this and other misconceptions in the article on <a href="https://blog.nrwl.io/misconceptions-about-monorepos-monorepo-monolith-df1250d4b03c">“Misconceptions about Monorepos: Monorepo != Monolith”</a>.</p><p>TLDR:</p><ul><li>Everything at that current commit works together.</li><li>Changes can be verified across all affected parts of the organization.</li><li>Easy to split code into composable modules</li><li>Easier dependency management</li><li>One toolchain setup</li><li>Code editors and IDEs are “workspace” aware</li><li>Consistent developer experience</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/939/1*ZLbdRuP7daV-sgDA97agDQ.png" /><figcaption>Source: <a href="https://monorepo.tools/#why-a-monorepo">https://monorepo.tools/#why-a-monorepo</a></figcaption></figure><p><strong>Is Google still using monorepo?</strong></p><p>Believe it or not, <strong><em>Google is one of the biggest monorepo users in our industry.</em></strong> I’m not making this up, their code base is huge, as you can probably imagine, and reports state that they have 95% of it inside the same repository. I rest my case 🎯</p><p>In all seriousness, I’m just documenting my learning for a project at work. Anyways #LearningInPublic.</p><h3><strong>Creating a simple api to Fetch Cricket Players 🏏</strong></h3><p>Let’s start with creating a simple express app and make a REST controller that fetches the data from the our dataset.</p><p>Let’s initialize a repository. Initialize a repository using:</p><blockquote>npx create-nx-workspace — preset=express</blockquote><blockquote>Opens an interactive shell and initializes the repo; repository name = nx-cricket; application name = nx-cricket-api (if you wish to follow to the t)</blockquote><blockquote>nx serve nx-cricket-api # to test the app</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/260/1*q8Z19H2Bp0REJAb3Hr7wNg.png" /><figcaption>Directory Structure 😅</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*aJkGDadPS8VfTtFUY9QpmQ.png" /><figcaption>So far so good</figcaption></figure><p>I’ve downloaded the dataset <a href="https://data.world/raghav333/cricket-players-espn">https://data.world/raghav333/cricket-players-espn</a> and cleaned the dataset and wrote a simple python script to convert it into a JSON to make it easier export it as a typescript array. Note: the entire Code/Dataset is available on GitHub.</p><pre>export interface Cricketer {<br>    Id: string<br>    Name: string;<br>    Country: string;<br>    &#39;Full name&#39;: string;<br>    Age: string;<br>    &#39;Major teams&#39;: string;<br>    &#39;Batting style&#39;: string;<br>    &#39;Bowling style&#39;: string;<br>    Other: string;<br>}<br><br>export const cricketers: Cricketer[] = [<br>    {<br>        &quot;Id&quot;: &quot;0&quot;,<br>        &quot;Name&quot;: &quot;Henry Arkell&quot;,<br>        &quot;Country&quot;: &quot;England&quot;,<br>        &quot;Full name&quot;: &quot;Henry John Denham Arkell&quot;,<br>        &quot;Age&quot;: &quot;84&quot;,<br>        &quot;Major teams&quot;: &quot;Northamptonshire&quot;,<br>        &quot;Batting style&quot;: &quot;Right-hand bat&quot;,<br>        &quot;Bowling style&quot;: &quot;&quot;,<br>        &quot;Other&quot;: &quot;&quot;<br>    }, ...<br>]</pre><p><strong>Define a Simple Controller in nx-cricket-api&gt;src&gt;main.ts</strong></p><p>Lets just define two simple controllers to retrieve all the cricketers and Query endpoint to return cricketer by name.</p><pre>// Returns all the cricketers <br>app.get(&#39;/cricketers&#39;, (_, res) =&gt; {<br>  res.send({cricketers});<br>});<br><br>// Returns cricketer: Lazy Search by Name<br>app.get(&#39;/search&#39;, (req, res) =&gt; {<br>  const q = ((req.query.q as string) ?? &#39;&#39;).toLocaleLowerCase();<br>  res.send(cricketers.filter( ({ Name }) =&gt; <br>    Name.toLocaleLowerCase().includes(q)<br>  ));<br>});</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*K9wbkpK3nf6d7C2O05PRzA.png" /><figcaption>localhost:3333/cricketers</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*OBQjyLYIxC7BAmrtZdDRLg.png" /><figcaption>localhost:3333/search?q=Barb</figcaption></figure><p><strong>Let’s create a simple frontend with Next.js</strong></p><p>First we have to install the nrwl/next if it has not already been installed. Then we create a simple next app, a next app boilerplate is generated. We add a simple application that renders the list of cricket players</p><pre># Installing nrwl/next<br><br># yarn<br>yarn add --dev @nrwl/next<br><br># npm<br>npm install --save-dev @nrwl/next<br><br># create nx app<br>nx g @nrwl/next:app<br>name = nx-cricket-search<br>style = css</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/233/1*CqUCR35f-n1bb5e-PWzRgQ.png" /><figcaption>Api, Frontend and Test folders generated</figcaption></figure><p>Removed all the boilerplate and added this basic frontend code</p><pre>import { useEffect, useState, useCallback } from &#39;react&#39;;<br>import React from &#39;react&#39;;<br><br>import { Cricketer } from &#39;@nx-cricket/shared-types&#39;;<br><br>export function Index() {<br>  const [search, setSearch] = useState(&#39;&#39;);<br>  const [cricketer, setCricketer] = useState&lt;Cricketer[]&gt;([]);<br><br>  useEffect(() =&gt; {<br>    fetch(`http://localhost:3333/search?q=${escape(search)}`)<br>      .then((resp) =&gt; resp.json())<br>      .then((data) =&gt; setCricketer(data));<br>  }, [search]);<br><br>  const onSetSearch = useCallback(<br>    (evt: React.ChangeEvent&lt;HTMLInputElement&gt;) =&gt; {<br>      setSearch(evt.target.value);<br>    },<br>    []<br>  );<br><br>  return (<br>    &lt;div&gt;<br>      &lt;input<br>        style={{ padding: &#39;10px&#39;, margin: &#39;20px&#39; }}<br>        value={search}<br>        placeholder=&quot;Enter Cricketer Name&quot;<br>        onChange={onSetSearch}<br>      /&gt;<br>      &lt;ul&gt;<br>        {cricketer.map(({ Id, Name, Country, Age }) =&gt; (<br>          &lt;li key={Id}&gt;<br>            {Name}, {Country}, {Age}<br>          &lt;/li&gt;<br>        ))}<br>      &lt;/ul&gt;<br>    &lt;/div&gt;<br>  );<br>}<br><br>export default Index;</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/469/1*HMMRZMjYwERlZjT2zZSpKg.png" /><figcaption>Lazy Search :)</figcaption></figure><p>You can also add server-side rendering. But for this tutorial I’ve decided to keep it simple.</p><p><strong>Shared types</strong></p><p>One of the best features of monorepos is the ability to use shared types, if you see the frontend code, we have strongly typed the cricketer array, this was because we were able to add the type ~ interface in shared-types folder from the Cricketer.ts file. This is one of the most useful features of the monorepo structure.</p><pre># Creating a shared Library<br>nx g @nrwl/node shared-types<br><br># In libs&gt;shared-types&gt;src&gt;index.ts add | already found in our Cricketer.tsc<br>export interface Cricketer {<br>    Id: string<br>    Name: string;<br>    Country: string;<br>    &#39;Full name&#39;: string;<br>    Age: string;<br>    &#39;Major teams&#39;: string;<br>    &#39;Batting style&#39;: string;<br>    &#39;Bowling style&#39;: string;<br>    Other: string;<br>}<br><br># Cleanup cricketer.ts<br>import type { Cricketer } from &quot;@nx-cricket/shared-types&quot;<br><br>export const cricketers: Cricketer[] = [<br>    {<br>        &quot;Id&quot;: &quot;0&quot;,<br>        &quot;Name&quot;: &quot;Henry Arkell&quot;,<br>        &quot;Country&quot;: &quot;England&quot;,<br>        &quot;Full name&quot;: &quot;Henry John Denham Arkell&quot;,<br>        &quot;Age&quot;: &quot;84&quot;,<br>        &quot;Major teams&quot;: &quot;Northamptonshire&quot;,<br>        &quot;Batting style&quot;: &quot;Right-hand bat&quot;,<br>        &quot;Bowling style&quot;: &quot;&quot;,<br>        &quot;Other&quot;: &quot;&quot;<br>    },...<br>]<br><br># type { Cricketer } from &quot;@nx-cricket/shared-types&quot; can be accessed from any project</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*EZTecMkwBXGYl7HV7TOJxw.png" /><figcaption>Dependency Graph for our Project: generated with nx graph</figcaption></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=d8246b859487" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Crypto Tax: Blessing in Disguise?]]></title>
            <link>https://msris108.medium.com/crypto-tax-blessing-in-disguise-5b1b1513f9ca?source=rss-bba4fc74f7a5------2</link>
            <guid isPermaLink="false">https://medium.com/p/5b1b1513f9ca</guid>
            <category><![CDATA[capital-gains]]></category>
            <category><![CDATA[union-budget-2021-india]]></category>
            <category><![CDATA[crypto-tax-bill]]></category>
            <category><![CDATA[cryptocurrency]]></category>
            <category><![CDATA[blockchain]]></category>
            <dc:creator><![CDATA[Sriram]]></dc:creator>
            <pubDate>Sat, 05 Feb 2022 09:32:44 GMT</pubDate>
            <atom:updated>2022-02-05T09:32:44.584Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*zVhC2XCyPBcdTNOs2vGKyA.jpeg" /></figure><p>In the Union Budget of 2022, the government of India announced that it would be launching a digital rupee and would start taxing income from virtual digital assets. A 30 per cent tax on any income from the <strong>transfer</strong> of virtual digital assets to be precise. The term “virtual-digital” asset might seem redundant in usage as something that is “virtual” (at least with the technology available now) by definition has to be functionally digital. Defining terminologies in the crypto space has always been one of the most challenging tasks for the govt. In late 2021 there were many ambiguities with the definitions of cryptocurrencies or crypto assets, which started an Indian Crypto Ban hoax, causing unnecessary chaos, which probably led to the definition of the term “Virtual Digital Assets”.</p><p>In the explanatory memorandum of the Finance Bill, the government stated, “To define the term “virtual digital asset”, a new clause (47A) is proposed to be inserted to section 2 of the Act. As per the proposed new clause, a virtual digital asset is proposed to mean any information or code or number or token (not being Indian currency or any foreign currency), generated through cryptographic means or otherwise, by whatever name called, providing a digital representation of value which is exchanged with or without consideration, with the promise or representation of having inherent value, or functions as a store of value or a unit of account and includes its use in any financial transaction or investment, but not limited to, investment schemes and can be transferred, stored or traded electronically. Non-fungible Token and any other token of similar nature are included in the definition.”</p><p>Now that the definition of virtual digital assets has unequivocally been clarified (sarcasm), what does the “monstrous” tax on crypto assets mean to an average citizen? Purely from a policy-making perspective, I find large taxes on certain commodities to be counterproductive if it does not cause an intrinsic rejection among the citizens to be associated with that particular commodity. We have seen this to be the case with alcohol, tobacco etc. And to say that India has accepted blockchain as the future of good governance and to impose a huge tax on crypto assets might seem hypocritical at first glance. But is that the case?</p><p>As someone passionate about de-fi (decentralized finance), blockchain technology and Web3, I have never quite understood the reason behind cryptocurrencies being so volatile, as the gas fee (transaction fee) on transacting from popular chains like Bitcoin and Ethereum to be extremely high (for a reason!) especially for someone to make a living out of day-trading crypto assets. And I find no other compelling reason but the fact that people/traders measure or predict the value of Cryptocurrencies purely using the tools and techniques that have been applied to the equity and forex markets. Now it is a simple question of whether people investing in cryptocurrencies are in it because they believe in the technology or have a cliche algorithm running on their computers/mobile phones that predicts that their investment would reap extraordinary benefits in the immediate future? The latter has led to many youngsters blindly investing in cryptocurrencies without understanding the consequences. Moreover, extensive advertising, rising NFT hype and our favourite Elon “The DogeFather” Musk phenomenon has played a persuasive role in making crypto investments. Further justifying the government’s concerns regarding crypto trading.</p><p>My friend’s silly solution to this taxation was to mine bitcoin and buy groceries from the dark web. As absurd as it may sound, he captured the essence of the technology (or at least some of it ) more than many who make a living out of it. And this shows the bigger picture of the “Crypto Conundrum” in India. People who invest in crypto because they believe in the technology(“the damned hodlers”) and people who invest in crypto because they want to diversify their portfolio would be far less affected by this taxation than your average “Dogecoin to the Moon” Joes and “Rainbow Ape NFT Collector” Billys.</p><p>Finally, I would like to draw attention to the word “transfer” that I highlighted in the first paragraph. To put it naively, you would only have to pay the taxes when you get a capital gain out of it, or simply put the money you make on a profitable trade. When you have invested in the technology that facilitates the most secure transactions, why not transact using the same! Making day trading profits on crypto is a classic example of failed “Indian Jugaad”.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/581/1*yqP00Q7qv5zF_fOWkCqQHQ.jpeg" /><figcaption>Why use a banana as a stand for incense sticks when you can eat the damn fruit!</figcaption></figure><p>Thus I hope this law incentivises people to learn about crypto, read the whitepaper before investing in it and use it the way it was intended to be used.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5b1b1513f9ca" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Osintgram: The untold side of Instagram]]></title>
            <link>https://msris108.medium.com/osintgram-the-untold-side-of-instagram-3a7537c75bfd?source=rss-bba4fc74f7a5------2</link>
            <guid isPermaLink="false">https://medium.com/p/3a7537c75bfd</guid>
            <category><![CDATA[cybersecurity]]></category>
            <category><![CDATA[instagram]]></category>
            <category><![CDATA[osint]]></category>
            <category><![CDATA[safety]]></category>
            <dc:creator><![CDATA[Sriram]]></dc:creator>
            <pubDate>Thu, 13 May 2021 16:30:46 GMT</pubDate>
            <atom:updated>2021-05-13T16:30:46.599Z</atom:updated>
            <content:encoded><![CDATA[<p><strong>DISCLAIMER: This is not a “How to hack someone on Instagram Tutorial”. But rather an awareness post on how people get scammed on the internet and how to protect yourself from getting hacked.</strong></p><p>Firstly, I would like to make something clear. If you intend to hack someone you’ve come to the wrong place. It is absolutely contemptible if you want to hack someone without their consent and with the tools available, it is highly unlikely that you get away with it! I believe in the principles of transparency of data. The information that I post here are publicly available and anyone can access it. And unfortunately getting hold of this and making it work is easier than you think. And I strongly believe that people should be aware of such scams. Having said all that hope is not lost, there are straight-up measures to make sure you are safe from getting hacked.</p><p>If you are not into the technical details but just want to learn how to protect yourself from the attack go to the end of the post.</p><p>The Script goes as follows:</p><p><strong>1.OSINTGRAM</strong></p><p>Well to start things off, what is <em>OSINT?</em></p><p>OSINT, otherwise Open Source Intelligence is a multi-methods methodology for collecting, analyzing and making decisions about data accessible in publicly available sources to be used in an intelligence context. In simpler words, these are publically available information that can be used for data analysis, data collection etc.</p><p>Osintgram is essentially a computer program that uses the Instagram <a href="https://en.wikipedia.org/wiki/API">API</a> to gather information. On paper, there is nothing illegal about it, and it’s beautifully written code ( credits to the developers ). The more I think about it there are so many practical applications!</p><p><a href="https://github.com/Datalux/Osintgram.git">GitHub - Datalux/Osintgram: Osintgram is a OSINT tool on Instagram. It offers an interactive shell to perform analysis on Instagram account of any users by its nickname</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*itMzauro7xlMKlrJOFTdBQ.png" /><figcaption>The account is a dummy account created for educational purposes</figcaption></figure><p>Apart from flaunting the rather “typical hacker screen” terminal window, the developers have written code simple yet efficient code in your favourite language, C++( Just kidding it’s written in python XD ). But having gone through the code I must say it is just simple Instagram API calls. With which you can gather the following information:</p><ol><li>All registered addressed by target photos</li><li>Target’s photos captions</li><li>A list of all the comments on the target’s posts</li><li>Total comments of target’s posts</li><li>Target followers</li><li>Users followed by target</li><li>Email of target followers</li><li>Email of users followed by target</li><li>Phone number of target followers</li><li>Phone number of users followed by target</li><li>Hashtags used by the target</li><li>Total likes of target’s posts</li><li>Target’s posts type (photo or video)</li><li>Description of target’s photos</li><li>Download target’s photos in the output folder</li><li>Download target’s profile picture</li><li>Download target’s stories</li><li>List of users tagged by target</li><li>A list of user who commented target’s photos</li><li>A list of user who tagged target</li></ol><p>But the only hope is that all this information is accessible <strong>if</strong> <strong>the account is public or the account of the victim is followed by the perpetrator. </strong>So as a general rule of thumb do not follow some account you have no clue about. And as far as public accounts are concerned, this is process is rather computationally intensive and impossible to retrieve information ( at least for your everyday hacker who googled “how to hack someone on Instagram” ). And the information about the account of the hacker will be gathered at Instagram’s end.</p><p>The main scope of this article is complete but for the sake of demonstration on how a typical script will be written. I’ll be continuing with some more steps.</p><p><strong>2. Blackeye</strong></p><p>Blackeye is yet another Social Engineering tool that is available publicly on the internet. This allows anyone to host a dummy version of a well-known website to get information like the username and password. This is a far more powerful tool, at the same time, it can be easily detected. The website will have to be hosted ( mostly on temporary platforms like <a href="https://computernewb.com/wiki/Ngrok">ngrok</a>, serveo etc. )</p><p>So as a general rule of thumb, never open rather anonymous links especially ones ending with .ngrok.</p><p>But the original authors of the script have taken it down. Having said that there are many modified versions of the OG Blackeye is pretty easily accessible.</p><p><strong>3.SET</strong></p><p>SET ( Social Engineering Toolkit ) This is a popular tool usually packed with the default installation of Kali Linux(or any pen-testing distro for that matter). This is a swiss-army knife for social engineering, essentially gives you a list of tools for basic social engineering. The tool was intended to simulate an actual phishing mail for typical red-hat hackers (ethical).</p><p>So typically the perpetrator would create a dummy account and follow your Instagram account. Then would extract the information from your account using OSINTgram and gather information like ( say the email ids of your followers ). He would then send a string of spam emails to all your followers and would provide the link to a dummy website hosted using blackeye and people who ignorantly and log in with this link would compromise their credentials :(</p><p><strong>Is all hope lost?</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/474/0*HgPrdmjeAY4VUpXN" /></figure><p>NO! this “scam” or most of the prevailing scams requires a lot of luck and a continuous series of careless moves by the victim. And many of these could be avoided with simple steps.</p><p><strong>1.Use a spam filter!</strong></p><p>All email services have spam filters, here is a link to a step-by-step guide to adding spam filters:</p><p><a href="https://www.rightinbox.com/blog/gmail-spam-filter">https://www.rightinbox.com/blog/gmail-spam-filter</a></p><p><strong>2.Do not click on unknown or suspicious links.</strong></p><p>Most of the popular organisations host their links from their own server and it is highly likely that the domain name contains the name of the organisation and the website in it clearly. If the URL doesn&#39;t explicitly give that out avoid clicking on that link. And to double ensure I recommend using <a href="https://www.virustotal.com/gui/home/url">Virus Total</a> and check if the website is safe to be visited.</p><p><strong>3.Avoid accepting follow requests from suspicious / rather unknown accounts.</strong></p><p>I must say it is quite difficult to make an anonymous account these days that don&#39;t get flagged almost instantly. And it is unlikely that the hacker would get away with it. But nevertheless, it always better to not get hacked and go through the whole process.</p><p><strong>4.Get yourself educated! Do not be technology ignorant!</strong></p><p>Do follow cybersecurity updates the latest trends at least the most popular ones.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3a7537c75bfd" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to setup a Pseudo-distributed Cluster with Hadoop 3.2.1 and Apache Spark 3.0]]></title>
            <link>https://msris108.medium.com/how-to-setup-a-pseudo-distributed-cluster-with-hadoop-3-2-1-and-apache-spark-3-0-34406a85130f?source=rss-bba4fc74f7a5------2</link>
            <guid isPermaLink="false">https://medium.com/p/34406a85130f</guid>
            <category><![CDATA[hadoop]]></category>
            <category><![CDATA[spark]]></category>
            <category><![CDATA[pseudo-distributed-mode]]></category>
            <category><![CDATA[ubuntu]]></category>
            <dc:creator><![CDATA[Sriram]]></dc:creator>
            <pubDate>Fri, 14 Aug 2020 13:04:09 GMT</pubDate>
            <atom:updated>2020-08-14T13:04:09.902Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*b2yZjrhD_zDyKVtT-4yCQA.png" /><figcaption><a href="https://techcrunch.com/2015/07/12/spark-and-hadoop-are-friends-not-foes/">https://techcrunch.com/2015/07/12/spark-and-hadoop-are-friends-not-foes/</a></figcaption></figure><p>This post is an installation guide for Apache Hadoop 3.2.1 and Apache Spark 3.0 [latest stable versions] based on the assumption that you have used Big Data frameworks like Hadoop and Apache Spark before and you want to try out the latest versions of the Hadoop and Spark environments for <strong>development </strong>purposes. Needless to say I will cover the fundamentals Apache Hadoop and Apache spark.</p><p><strong>Note</strong>: This installation is not meant to be used in a real-life / production environment. My next post will cover the setup for a multi-node cluster setup for a production environment.</p><h3>What is the difference between Stand-Alone mode and pseudo-distributed mode?</h3><p><strong>Single Node (Local Mode or Standalone Mode)</strong> <br>Standalone mode is the default mode in which Hadoop run. Standalone mode is mainly used for debugging where you don’t really use HDFS. <br>You can use input and output both as a local file system in standalone mode.</p><p>You also don’t need to do any custom configuration in the files- mapred-site.xml, core-site.xml, hdfs-site.xml.</p><p>Standalone mode is usually the fastest Hadoop modes as it uses the local file system for all the input and output.</p><p><strong>Pseudo-distributed Mode</strong> <br>The pseudo-distributed mode is also known as a single-node cluster where both NameNode and DataNode will reside on the same machine.</p><p>In pseudo-distributed mode, all the Hadoop daemons will be running on a single node. Such configuration is mainly used while testing when we don’t need to think about the resources and other users sharing the resource.</p><p>In this architecture, a separate JVM is spawned for every Hadoop components as they could communicate across network sockets, effectively producing a fully functioning and optimized mini-cluster on a single host.</p><p>So, in case of this mode, changes in configuration files will be required for all the three files- mapred-site.xml, core-site.xml, hdfs-site.xml.</p><h3>HDFS and MapReduce:</h3><p><strong>Apache Hadoop </strong>is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation.The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. Thus allowing the dataset to be processed faster and more efficiently than it would be in a more conventional computer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking. In short HDFS gives us a base to store large datasets which is distributed among multiple nodes and a faster and more efficient data retrieval technique using the MapReduce programming model.</p><p><em>Useful Resources:</em></p><p><a href="https://hadoop.apache.org/docs/stable/">Apache Hadoop 3.2.1</a></p><p><a href="https://en.wikipedia.org/wiki/MapReduce">https://en.wikipedia.org/wiki/MapReduce</a></p><p><strong>The base Apache Hadoop framework is composed of the following modules:</strong></p><ul><li><em>Hadoop Common</em> — contains libraries and utilities needed by other Hadoop modules;</li><li><em>Hadoop Distributed File System (HDFS)</em> — a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;</li><li><em>Hadoop YARN</em> — a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications</li><li><em>Hadoop MapReduce</em> — an implementation of the MapReduce programming model for large-scale data processing.</li></ul><p>The term <em>Hadoop</em> is often used for both base modules and sub-modules and also the <em>ecosystem</em>, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as <a href="https://en.wikipedia.org/wiki/Pig_(programming_tool)">Apache Pig</a>, <a href="https://en.wikipedia.org/wiki/Apache_Hive">Apache Hive</a>, <a href="https://en.wikipedia.org/wiki/Apache_HBase">Apache HBase</a>, <a href="https://en.wikipedia.org/wiki/Apache_Phoenix">Apache Phoenix</a>, <a href="https://en.wikipedia.org/wiki/Apache_Spark">Apache Spark</a>, <a href="https://en.wikipedia.org/wiki/Apache_ZooKeeper">Apache ZooKeeper</a>, <a href="https://en.wikipedia.org/wiki/Cloudera_Impala">Cloudera Impala</a>, <a href="https://en.wikipedia.org/wiki/Apache_Flume">Apache Flume</a>, <a href="https://en.wikipedia.org/wiki/Apache_Sqoop">Apache Sqoop</a>, <a href="https://en.wikipedia.org/wiki/Apache_Oozie">Apache Oozie</a>, and <a href="https://en.wikipedia.org/wiki/Apache_Storm">Apache Storm</a>. In this post we shall install Apache Spark along with Hadoop.</p><h3>Installation of Hadoop:</h3><h4><strong>Pre-req:</strong></h4><ol><li>A Linux distribution system (vm should work fine, but it is not recommended )</li><li>Sudo privileges</li><li>A Decent computer with stable internet connection (ony for downloading the necessary software)</li></ol><h4>Installation:</h4><ol><li>Install Java</li></ol><pre>sudo apt update<br>sudo apt install openjdk-8-jdk openjdk-8-jre<br># this command is for an ubuntu system</pre><p>2. See the <a href="http://wiki.apache.org/hadoop/HadoopJavaVersions">Hadoop Wiki</a> for known good versions. I used java version 8. Verify your installation using java -version.</p><pre>(base) sriram@sriram-Inspiron-7572:~$ <strong>java -version<br>openjdk version &quot;1.8.0_265&quot;</strong><br>OpenJDK Runtime Environment (build 1.8.0_265-8u265-b01-0ubuntu2~20.04-b01)<br>OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)</pre><p>3. To change the Java version used</p><pre>(base) sriram@sriram-Inspiron-7572:~$<strong>sudo update-alternatives --config java</strong><br>There are 2 choices for the alternative java (providing /usr/bin/java).</pre><pre>Selection    Path                                            Priority   Status<br>------------------------------------------------------------<br>  0            /usr/lib/jvm/java-11-openjdk-amd64/bin/java      1111      auto mode<br>  1            /usr/lib/jvm/java-11-openjdk-amd64/bin/java      1111      manual mode<br>* 2            /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java   1081      manual mode</pre><pre>Press &lt;enter&gt; to keep the current choice[*], or type selection number:</pre><p>I have moved my java-8-openjdk-amd to /usr/local/ ( personal preference ). I suggest you follow the same for the sake of the tutorial or note down the location properly for $JAVA_HOME</p><p>4. Add JAVA_HOME to ~/.bashrc</p><p><strong>Note: </strong>bashrc is a very powerful file, changes made to this file can corrupt your system. Nonetheless use the file carefully make sure you don’t delete / add unnecessary lines here. In this tutorial (and every tutorial) you’ll find that instructors suggest you use the nano / vi text editor. People from pure windows background might find it hard to use hence I would recommend you use gedit/subl for this. (just replace nano/vi with gedit)</p><pre><strong>$ sudo nano ~/.bashrc</strong> #to open bashrc</pre><p>scroll to the end and paste these lines</p><pre># JAVA VARIABLES<br>export JAVA_HOME=/usr/local/java-8-openjdk-amd64 <br>export PATH=$PATH:$JAVA_HOME/bin</pre><p>Save and close (ctrl + s and ctrl + x for nano)</p><p>5. Install ssh and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons if the optional start and stop scripts are to be used. Additionally, it is recommended that pdsh also be installed for better ssh resource management.(In Ubuntu)</p><pre>(base) sriram@sriram-Inspiron-7572:~$ sudo apt-get install ssh<br>(base) sriram@sriram-Inspiron-7572:~$ sudo apt-get install pdsh</pre><p>Make sure you add this line in you ~/.bashrc file</p><pre># this line is to ensure pdsh uses ssh<br>export PDSH_RCMD_TYPE=ssh</pre><p>6. Setup passphraseless ssh</p><p>Now check that you can ssh to the localhost without a passphrase:</p><pre>$ ssh localhost</pre><p>If you cannot ssh to localhost without a passphrase, execute the following commands:</p><pre>$ ssh-keygen -t rsa -P &#39;&#39; -f ~/.ssh/id_rsa<br>$ cat ~/.ssh/id_rsa.pub &gt;&gt; ~/.ssh/authorized_keys<br>$ chmod 0600 ~/.ssh/authorized_keys</pre><p>Once you are done it should like this:</p><pre>(base) sriram@sriram-Inspiron-7572:~$ ssh localhost<br>Welcome to Ubuntu 20.04.1 LTS (GNU/Linux 5.4.0-42-generic x86_64)</pre><pre>* Documentation:  <a href="https://help.ubuntu.com">https://help.ubuntu.com</a><br> * Management:     <a href="https://landscape.canonical.com">https://landscape.canonical.com</a><br> * Support:        <a href="https://ubuntu.com/advantage">https://ubuntu.com/advantage</a></pre><pre>* Are you ready for Kubernetes 1.19? It&#39;s nearly here! Try RC3 with<br>   sudo snap install microk8s --channel=1.19/candidate --classic</pre><pre><a href="https://microk8s.io/">https://microk8s.io/</a> has docs and details.</pre><pre>2 updates can be installed immediately.<br>0 of these updates are security updates.<br>To see these additional updates run: apt list --upgradable</pre><pre>Your Hardware Enablement Stack (HWE) is supported until April 2025.<br>*** System restart required ***<br>Last login: Fri Aug 14 13:17:31 2020 from 127.0.0.1</pre><p>7. Download and extract hadoop 3.2.1 software package in the location of your choice.</p><pre>$ wget <a href="https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz">https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz</a> #DOWNLOAD<br>$ tar xzf hadoop-3.2.1.tar.gz #EXTRACT<br>$ mv hadoop-3.2.1 hadoop #rename<br>$ mv hadoop /usr/local/</pre><p>You can manually download from the given link and extract the files and place it in any location. I placed hadoop at /usr/local/</p><p><a href="https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.1/hadoop-3.2.1-src.tar.gz">Apache Download Mirrors</a></p><p>8. Set Hadoop environment variables</p><p>add this line to your /etc/environment file</p><pre>PATH=&quot;/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin&quot;JAVA_HOME=&quot;/usr/lib/jvm/java-8-openjdk-amd64/jre&quot;</pre><p>Add these lines of code to your bashrc file</p><pre># this line is added so that the environment file which contains $HADOOP_HOME, which is needed for running &quot;hadoop&quot; command anywhere in the system (multi-environment)<br>source /etc/environment</pre><pre>export HADOOP_HOME=/usr/local/hadoop <br>export HADOOP_MAPRED_HOME=$HADOOP_HOME <br>export HADOOP_COMMON_HOME=$HADOOP_HOME <br># this line is used to compile the java code in 64bit compiler instead of default 32bit (this will not affect functionality but will improve performance) this is associated with the WARN.<br>export HADOOP_OPTS=&quot;$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native&quot;<br>export HADOOP_HDFS_HOME=$HADOOP_HOME <br>export YARN_HOME=$HADOOP_HOME <br>export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native <br>export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin <br>export HADOOP_INSTALL=$HADOOP_HOME</pre><p>To update the variables run <strong>source ~/.bashrc</strong></p><p>9. <strong>** Edit Config Files **</strong></p><p>This is the most important section of the module. Follow the steps carefully.</p><ul><li>Add these lines in the &lt;configuration&gt; tags of the following lines *</li><li>replace the existing &lt;configuration&gt; … &lt;/configuration&gt; tags</li></ul><p>$HADOOP_HOME/etc/hadoop/core-site.xml:</p><pre>&lt;configuration&gt;<br>&lt;property&gt;<br>  &lt;name&gt;hadoop.tmp.dir&lt;/name&gt;<br>  &lt;value&gt;/usr/local/hadoop/tmpdata&lt;/value&gt;<br>&lt;/property&gt;<br>&lt;property&gt;<br>  &lt;name&gt;fs.default.name&lt;/name&gt;<br>  &lt;value&gt;hdfs://127.0.0.1:9000&lt;/value&gt;<br>&lt;/property&gt;<br>&lt;/configuration&gt;</pre><p>$HADOOP_HOME/etc/hadoop/hdfs-site.xml:</p><pre>&lt;configuration&gt;<br>&lt;property&gt;<br>  &lt;name&gt;dfs.data.dir&lt;/name&gt;<br>  &lt;value&gt;/usr/local/hadoop/dfsdata/namenode&lt;/value&gt;<br>&lt;/property&gt;<br>&lt;property&gt;<br>  &lt;name&gt;dfs.data.dir&lt;/name&gt;<br>  &lt;value&gt;/usr/local/hadoop/dfsdata/datanode&lt;/value&gt;<br>&lt;/property&gt;<br>&lt;property&gt;<br>  &lt;name&gt;dfs.replication&lt;/name&gt;<br>  &lt;value&gt;1&lt;/value&gt;<br>&lt;/property&gt;<br>&lt;/configuration&gt;</pre><p>$HADOOP_HOME/etc/hadoop/mapred-site.xml:</p><pre>&lt;configuration&gt; <br>&lt;property&gt; <br>  &lt;name&gt;mapreduce.framework.name&lt;/name&gt; <br>  &lt;value&gt;yarn&lt;/value&gt; <br>&lt;/property&gt; <br>&lt;/configuration&gt;</pre><p>$HADOOP_HOME/etc/hadoop/yarn-site.xml:</p><pre>&lt;configuration&gt;<br>&lt;property&gt;<br>  &lt;name&gt;yarn.nodemanager.aux-services&lt;/name&gt;<br>  &lt;value&gt;mapreduce_shuffle&lt;/value&gt;<br>&lt;/property&gt;<br>&lt;property&gt;<br>  &lt;name&gt;yarn.nodemanager.aux-services.mapreduce.shuffle.class&lt;/name&gt;<br>  &lt;value&gt;org.apache.hadoop.mapred.ShuffleHandler&lt;/value&gt;<br>&lt;/property&gt;<br>&lt;property&gt;<br>  &lt;name&gt;yarn.resourcemanager.hostname&lt;/name&gt;<br>  &lt;value&gt;127.0.0.1&lt;/value&gt;<br>&lt;/property&gt;<br>&lt;property&gt;<br>  &lt;name&gt;yarn.acl.enable&lt;/name&gt;<br>  &lt;value&gt;0&lt;/value&gt;<br>&lt;/property&gt;<br>&lt;property&gt;<br>  &lt;name&gt;yarn.nodemanager.env-whitelist&lt;/name&gt;   <br>  &lt;value&gt;JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME&lt;/value&gt;<br>&lt;/property&gt;<br>&lt;/configuration&gt;</pre><p>10. Edit Hadoop-env.sh</p><p>The <em>hadoop-env.sh</em> file serves as a master file to configure YARN, HDFS, <a href="https://phoenixnap.com/kb/hadoop-mapreduce">MapReduce</a>, and Hadoop-related project settings.</p><p>When setting up a <strong>single node Hadoop cluster</strong>, you need to define which Java implementation is to be utilized. Use the previously created <strong>$HADOOP_HOME</strong> variable to access the <em>hadoop-env.sh</em> file:</p><p>Note the value:</p><pre>(base) sriram@sriram-Inspiron-7572:~$ $JAVA_HOME<br>bash: /usr/local/java-8-openjdk-amd64: Is a directory</pre><p>&gt;&gt; /usr/local/java-8-openjdk-amd64&lt;&lt; and open hadoop-env.sh file</p><pre>sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh</pre><p>Uncomment the <strong>$JAVA_HOME</strong> variable (i.e., remove the <strong># </strong>sign) and add the full path to the OpenJDK installation on your system. If you have installed the same version as presented in the first part of this tutorial, add the following line:</p><pre>export JAVA_HOME= /usr/local/java-8-openjdk-amd64</pre><p>11. Format the file system</p><pre>$ bin/hdfs namenode -format</pre><p>12. If everything as gone well till now you should be able to see this, you have successfully installed the standalone version of hadoop.</p><pre>(base) sriram@sriram-Inspiron-7572:~$ <strong>hadoop version</strong><br>Hadoop 3.2.1<br>Source code repository <a href="https://gitbox.apache.org/repos/asf/hadoop.git">https://gitbox.apache.org/repos/asf/hadoop.git</a> -r b3cbbb467e22ea829b3808f4b7b01d07e0bf3842<br>Compiled by rohithsharmaks on 2019-09-10T15:56Z<br>Compiled with protoc 2.5.0<br>From source with checksum 776eaf9eee9c0ffc370bcbc1888737<br>This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.2.1.jar</pre><p>13. Verify the installation</p><pre><strong>(base) sriram@sriram-Inspiron-7572:~$ start-all.sh</strong><br>WARNING: Attempting to start all Apache Hadoop daemons as sriram in 10 seconds.<br>WARNING: This is not a recommended production deployment configuration.<br>WARNING: Use CTRL-C to abort.<br>Starting namenodes on [localhost]<br>Starting datanodes<br>localhost: datanode is running as process 33621.  Stop it first.<br>Starting secondary namenodes [sriram-Inspiron-7572]<br>sriram-Inspiron-7572: secondarynamenode is running as process 33832.  Stop it first.<br>Starting resourcemanager<br>Starting nodemanagers<br><strong>(base) sriram@sriram-Inspiron-7572:~$ jps</strong><br>35475 Jps<br>33621 DataNode<br>35111 NodeManager<br>33832 SecondaryNameNode<br>34954 ResourceManager</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*dquaPbYy1f-S6Kjkfv-I_A.png" /><figcaption>Name Node</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*XbbzNgbGe-FW5TI9A9wUsw.png" /><figcaption>Data Node</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*5YFVAOR5sS_bgiD1tzUY6A.png" /><figcaption>Resource Manager</figcaption></figure><p>PORTS: [localhost]</p><p>8080: Resource Manager</p><p>9870: Name Node</p><p>9864: Data Node</p><h4>My bashrc file:</h4><pre># this line is added so that the environment file which contains $HADOOP_HOME, which is needed for running &quot;hadoop&quot; command anywhere in the system<br>source /etc/environment</pre><pre># JAVA VARIABLES<br>export JAVA_HOME=/usr/local/java-8-openjdk-amd64 <br>export PATH=$PATH:$JAVA_HOME/bin</pre><pre># HADOOP VARIABLES<br>export HADOOP_HOME=/usr/local/hadoop <br>export HADOOP_MAPRED_HOME=$HADOOP_HOME <br>export HADOOP_COMMON_HOME=$HADOOP_HOME <br># this line is used to compile the java code in 64bit compiler instead of default 32bit (this will not affect functionality but will improve performance) this is associated with the WARN.<br>export HADOOP_OPTS=&quot;$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native&quot;<br>export HADOOP_HDFS_HOME=$HADOOP_HOME <br>export YARN_HOME=$HADOOP_HOME <br>export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native <br>export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin <br>export HADOOP_INSTALL=$HADOOP_HOME</pre><pre># this line is to ensure pdsh uses ssh<br>export PDSH_RCMD_TYPE=ssh</pre><pre># SPARK VARIABLES<br>export SPARK_HOME=/usr/local/spark<br>export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin</pre><p>10. Test a Basic Command</p><p><a href="https://www.geeksforgeeks.org/hdfs-commands/">HDFS Commands - GeeksforGeeks</a></p><p>// Guess what the code does ? // (answer at the end)</p><pre>(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ <strong>bin/hdfs dfs -mkdir /user</strong><br>(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ <strong>bin/hdfs dfs -mkdir /user/sriram</strong><br>(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ <strong>hdfs dfs -mkdir /input</strong><br>(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ <strong>hdfs dfs -put etc/hadoop/*.xml /input</strong><br>2020-08-14 15:16:02,263 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false<br>2020-08-14 15:16:03,116 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false<br>2020-08-14 15:16:03,300 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false<br>2020-08-14 15:16:03,759 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false<br>2020-08-14 15:16:03,931 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false<br>2020-08-14 15:16:04,104 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false<br>2020-08-14 15:16:04,288 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false<br>2020-08-14 15:16:04,405 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false<br>2020-08-14 15:16:04,524 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false<br>(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ <strong>bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output &#39;dfs[a-z.]+&#39;</strong><br>2020-08-14 15:18:41,134 INFO client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032<br>2020-08-14 15:18:41,853 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/sriram/.staging/job_1597397135082_0001<br>2020-08-14 15:18:42,045 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false<br>2020-08-14 15:18:42,277 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/sriram/.staging/job_1597397135082_0001<br>org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://127.0.0.1:9000/user/sriram/input<br> at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:332)<br> at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274)<br> at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:396)<br> at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310)<br> at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327)<br> at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)<br> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570)<br> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567)<br> at java.security.AccessController.doPrivileged(Native Method)<br> at javax.security.auth.Subject.doAs(Subject.java:422)<br> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)<br> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567)<br> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1588)<br> at org.apache.hadoop.examples.Grep.run(Grep.java:78)<br> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)<br> at org.apache.hadoop.examples.Grep.main(Grep.java:103)<br> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)<br> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)<br> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)<br> at java.lang.reflect.Method.invoke(Method.java:498)<br> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)<br> at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)<br> at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)<br> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)<br> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)<br> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)<br> at java.lang.reflect.Method.invoke(Method.java:498)<br> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)<br> at org.apache.hadoop.util.RunJar.main(RunJar.java:236)<br>(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ <strong>bin/hdfs dfs -cat output/*</strong><br>cat: `<strong>output/part-r-00000</strong>&#39;: No such file or directory<br>cat: `<strong>output/_SUCCESS</strong>&#39;: No such file or directory<br>(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*DXUqi0PLrQ9KBUJnVzRIXA.png" /><figcaption>The changes reflected in the HFDS</figcaption></figure><pre>(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ cat output/part-r-00000 <br>1 dfsadmin<br>1 dfs.replication</pre><p>Code: [answer]</p><ol><li>Made a dir: input on the HDFS</li><li>hdfs dfs -put etc/hadoop/*.xml /input : puts all .xml files in input</li><li>Returned every file that started with dfs into output</li></ol><p>10. Finishing things off</p><pre>(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ stop-all.sh<br>WARNING: Stopping all Apache Hadoop daemons as sriram in 10 seconds.<br>WARNING: Use CTRL-C to abort.<br>Stopping namenodes on [localhost]<br>Stopping datanodes<br>Stopping secondary namenodes [sriram-Inspiron-7572]<br>Stopping nodemanagers<br>Stopping resourcemanager</pre><h3>Apache Spark</h3><p>Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads — batch processing, interactive queries, real-time analytics, machine learning, and graph processing. You’ll find it used by organizations from any industry, including at FINRA, Yelp, Zillow, DataXu, Urban Institute, and CrowdStrike. Apache Spark has become one of the most popular big data distributed processing framework with 365,000 meetup members in 2017.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/192/0*2_7on45thZAolxv0.png" /><figcaption>logo</figcaption></figure><h3>Apache Spark vs. Apache Hadoop</h3><p>Outside of the differences in the design of Spark and Hadoop MapReduce, many organizations have found these big data frameworks to be complimentary, using them together to solve a broader business challenge.</p><p>Hadoop is an open source framework that has the Hadoop Distributed File System (HDFS) as storage, YARN as a way of managing computing resources used by different applications, and an implementation of the MapReduce programming model as an execution engine. In a typical Hadoop implementation, different execution engines are also deployed such as Spark, Tez, and Presto.</p><p>Spark is an open source framework focused on interactive query, machine learning, and real-time workloads. It does not have its own storage system, but runs analytics on other storage systems like HDFS, or other popular stores like <a href="https://aws.amazon.com/redshift/">Amazon Redshift</a>, <a href="https://aws.amazon.com/s3/">Amazon S3</a>, Couchbase, Cassandra, and others. Spark on Hadoop leverages YARN to share a common cluster and dataset as other Hadoop engines, ensuring consistent levels of service, and response.</p><p>In this post I will not dive deep into the spark framework, but give a quick installation guide.</p><p><strong>Installation:</strong></p><p><a href="https://www.apache.org/dyn/closer.lua/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz">Apache Download Mirrors</a></p><ol><li>Download the file from the above link and place it at /usr/local</li><li>Add the following lines on bashrc(change the location if you have extracted in a different place)</li></ol><pre># Spark Variables<br>export SPARK_HOME=/usr/local/spark<br>export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin</pre><p>3. Add the following lines on $SPARK_HOME/bin/load-spark-env.sh</p><pre>export SPARK_LOCAL_IP=&quot;127.0.0.1&quot;</pre><p>4. Verify installation</p><pre>start-all.sh # To start all hadoop-daemons<br>spark-shell --master yarn # start spark with YARN</pre><pre>2020-08-14 17:53:50,165 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable<br>Setting default log level to &quot;WARN&quot;.<br>To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).<br>2020-08-14 17:54:00,660 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.<br>Spark context Web UI available at <a href="http://localhost:4040">http://localhost:4040</a><br>Spark context available as &#39;sc&#39; (master = yarn, app id = application_1597405003831_0005).<br>Spark session available as &#39;spark&#39;.<br>Welcome to<br>      ____              __<br>     / __/__  ___ _____/ /__<br>    _\ \/ _ \/ _ `/ __/  &#39;_/<br>   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0<br>      /_/<br>         <br>Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)<br>Type in expressions to have them evaluated.<br>Type :help for more information.</pre><pre>scala&gt;</pre><p><a href="https://github.com/msris108/BIG_DATA-PROJECTS">msris108/BIG_DATA-PROJECTS</a></p><p>Check out my GitHub repo that covers basics of Spark and SparkML. More articles on spark and sparkml will be posted soon.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=34406a85130f" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>