<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Justin Gosses on Medium]]></title>
        <description><![CDATA[Stories by Justin Gosses on Medium]]></description>
        <link>https://medium.com/@justingosses?source=rss-64df3cb11ba4------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*YD7j3CKEbnBHtpYaBkottw.jpeg</url>
            <title>Stories by Justin Gosses on Medium</title>
            <link>https://medium.com/@justingosses?source=rss-64df3cb11ba4------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sat, 06 Jun 2026 21:27:02 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@justingosses/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Beyond Awesome Lists]]></title>
            <link>https://justingosses.medium.com/beyond-awesome-lists-3ccb074f7859?source=rss-64df3cb11ba4------2</link>
            <guid isPermaLink="false">https://medium.com/p/3ccb074f7859</guid>
            <category><![CDATA[catalog]]></category>
            <category><![CDATA[d3js]]></category>
            <category><![CDATA[open-source]]></category>
            <category><![CDATA[geoscience]]></category>
            <category><![CDATA[awesome-list]]></category>
            <dc:creator><![CDATA[Justin Gosses]]></dc:creator>
            <pubDate>Mon, 31 May 2021 20:54:05 GMT</pubDate>
            <atom:updated>2021-05-31T22:10:23.365Z</atom:updated>
            <content:encoded><![CDATA[<p><em>Awesome lists are…. awesome. But could they be even more useful? What if instead of just a curated list you also got a view into a wider community?</em></p><h3>What is an Awesome list?</h3><p>An Awesome List is a community curated list of code projects within a specific domain, application, or use case. You can read more in the “<a href="https://github.com/sindresorhus/awesome/blob/main/awesome.md">Awesome Manifesto</a>” on <a href="https://github.com/sindresorhus">sindresorhus</a>/<a href="https://github.com/sindresorhus/awesome">awesome</a>.</p><p>They are great places to first look for open source code projects that others have found highly use in your problem space, be it markdown editors, JavaScript data visualization, Jupyter notebook widgets, GraphQL, spatial analysis, robotics, or <a href="https://project-awesome.org/phillipadsmith/awesome-github">hundreds</a> of other things. Other places people might discover code are word of mouth, google searches, trolling GitHub, seeing the packages used in other code, or having it recommended by a mentor or peer.</p><blockquote><em>Searching for useful code projects is an investment of time. Anything that shortens that time, and enables a developer to reuse code instead of starting from scratch, offers big rewards in terms of time savings.</em></blockquote><h3>What Makes Awesome Lists so Successful?</h3><p>This is mostly guesswork on my part, but I’d say the success of the “Awesome List” form is some combination of the following characteristics:</p><ol><li><strong>Easy to start</strong> (<em>Start a repository with a single README.md file</em>)</li><li><strong>Easy to contribute</strong> to (<em>Edit a single line in a markdown file</em>)</li><li><strong>Fills a need</strong> (E<em>very developer searches for new code tools)</em></li><li><strong>Builds a community</strong> (<em>They have a very large number of contributors. People come back to check the list from time to time. People send links to it to others.</em>)</li></ol><h3>What do Awesome Lists Not do Well?</h3><p>As a curated list, they provide a signal of potential value. To be included on an Awesome List, someone has to say a project is “Awesome”.</p><p>However, there are a variety of other types of information that would useful to know when evaluating what code project to use or what project to contribute to that Awesome Lists don’t provide. They include:</p><ol><li>How popular each code project is?</li><li>Is the rate this new code project is growing in popularity unusual relative to similar code projects in the past?</li><li>What code projects are used by other code projects within the domain space as a dependency and therefore you might want to learn about?</li><li>What code projects share dependencies and probably do related things?</li><li>What code projects share contributors?</li><li>What organizations own the most code projects in that domain/problem/solution space?</li></ol><blockquote><em>The interesting thing is, the information to get at all these questions exists, it just doesn’t exist in an easy to access to form most of the time.</em></blockquote><p>All of the questions above can be answered with metadata extracted via GitHub’s or Gitlab’s API. Even the dependency information, which was hard to get at several years ago, is now extracted by code platforms are a service and available from their APIs.</p><h3>Lawerence Livermore National Laboratory’s Software Catalog’s Explore Pages</h3><p>A great example of harvesting this type of metadata using GitHub’s API and turning it into insightful visualizations is the <a href="https://software.llnl.gov/">explore </a>section of Lawerence Livermore National Laboratory’s Software (LLNL) Catalog.</p><p><strong>Wanting to understand similar relationships for subsurface geoscience code, I decided to adapt LLNL’s project such that instead of visualization Lawerence Livermore National Laboratory’s code catalog, it was visualizing the code curated in </strong><a href="https://softwareunderground.org/"><strong>Software Underground</strong></a><strong>’s </strong><a href="https://github.com/softwareunderground/awesome-open-geoscience"><strong>Awesome-Open-Geoscience</strong></a><strong> awesome list.</strong></p><p>The Awesome-Open-Geoscience awesome list was started by myself and several other members of <a href="https://softwareunderground.org/">Software Underground</a>, or SWUNG, in October of 2017. As of May 2021, it has 127 watchers, 720 stars, and 51 different contributors. It the most popular repository if you search “geoscience” on GitHub.com.</p><p>Adapting <a href="https://github.com/LLNL/llnl.github.io">llnl.github.io</a> to make <a href="https://github.com/softwareunderground/open_geosciene_code_projects_viz">open_geosciene_code_projects_viz</a> wasn’t as easy as cloning it and re-running a script as the repository is less a tool and more a product. I had to remove lots of LLNL specific content, edit file paths, and replace bits of html, CSS, and JavaScript throughout the project. There’s still some work to get it to a place to where anyone else with a list of repositories they want to understand could clone the repository, change some lines in a configuration file, re-run a few scripts, and deploy as a GitHub pages page, but it’s getting there. Switching it from LLNL to SWUNG took me several days of work. Switching it for SWUNG to something else took me less than a day.</p><blockquote><em>Ideally, it would great if it only take a couple hours of work to repopulate all the pages and visualizations for another Awesome list.</em></blockquote><p><a href="https://softwareunderground.github.io/open_geosciene_code_projects_viz/">SWUNG Software Portal</a></p><h3>Images from the Explore Section of the website.</h3><p><strong><em>Beyond “curated list” and into “community level view”</em></strong></p><p>Most of these visualizations are interactive on the actual <a href="https://softwareunderground.github.io/open_geosciene_code_projects_viz/">website</a>. They allow for many questions to answered that provide a community level view of the code projects curated on the Awesome list.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xVfscgVd6JAD0CxBH5qthQ.png" /></figure><p><em>How often does a new code project in this space appear? How old are the older ones?</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QXFGLCuQJZ6pga0xdTYxDg.png" /></figure><p><em>How fast are people starting this new project versus older projects?</em></p><p><em>Has the number of new stars flatlined as people migrate to a new project?</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*zCEUmiX7g0mj3n1lTaDkYg.png" /></figure><p><em>What are the most common languages within this Awesome list?</em></p><p><em>What topics are the most common across all projects?</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*YPCty75ToDPxLsaYJ26WDA.png" /></figure><p><em>What organization or developer owns the most projects on the Awesome list?</em></p><p><em>What projects only have contributions from the code owners vs. contributions from many external parties and therefore might be better places to submit a pull request?</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QNvAhlS7Qt-B6GI8rZUfgQ.png" /></figure><p>Has activity jumped around a conference or other event?</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*vT4Eb42FWC9tuNEvZGK02Q.png" /></figure><p><em>What projects on the Awesome list are used as a dependency by other projects on the Awesome list?</em></p><p><em>Where to make contributions such that your code is reused by the highest number of projects?</em></p><p><em>What projects are built entirely different than the norm?</em></p><p><em>What projects is it more likely you could jump in and contribute to as the dependencies are already very similar to your existing work?</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0Y8QWMLUvyITJfE0qHFyQQ.png" /></figure><p><em>Picking a license can be hard. Is there a pattern in terms of what others in your community have gone with? Can you use this to help guide your choice?</em></p><h3>Additional Questions that Could be Answered that Would “Nudge” Developers</h3><p>As you might have already concluded from the questions that the visualizations help answer above, visualizing these type of information can nudge developers in ways the help direct development activity. This means visualizing the community of open source subsurface geoscience code has the potential to change how it develops in small ways. Thinking about it in this framing also helps suggest what future visualizations might have value.</p><p><strong>Connection that could be made visible… </strong>What repositories share dependencies and therefore do somewhat similar stuff? <strong>…which could nudge </strong>developers to understand code as related groups and not hundreds of non-related entities</p><p><strong>Connection that could be made visible… </strong>What code projects are used by projects I care about? <strong>…which could nudge </strong>developers to contribute to repositories that more people/projects depend on.</p><p><strong>Connection that could be made visible… </strong>Do you and someone else contribute to the same repos? Is there a group of people that tend to contribute to the same repos?<strong>…which could nudge </strong>developers to reach out to specific people forming community.</p><h3>The tech stack to get “Community Level View” from “Curated List”</h3><p><em>So is this code at a place where anyone with an Awesome List, software catalog, or other type of list of repositories could easily re-run and deploy their own version?</em></p><p><em>Unfortunately, </em><strong><em>Not yet.</em></strong></p><p>As noted above, converting the existing LLNL software catalog into <a href="https://github.com/softwareunderground/open_geosciene_code_projects_viz">open_geoscience_code_projects_visualization</a> took a while. It was started during a <a href="https://softwareunderground.org/events/transform-2021">hackathon</a> but probably took 2–3 days of full time work. The current version is easier to adapt to another list but it would probably take at least a full day for someone unfamiliar with the code base. Additionally, repositories aren’t being pulled directly from the Awesome List programmatically, so any additions in the Awesome List would need to be added manually currently.</p><p>We can talk about possible futures, though.</p><blockquote><em>In an ideal world, it would be easy and quick to take an existing Awesome Lists and create all these visualizations for the repositories categorized on that list. The visualizations would also stay up to date with the Awesome List.</em></blockquote><p>Future Requirements:</p><ol><li>An Awesome List scrapper to pull out GitHub repository links. [<em>No work done on this yet</em>]</li><li>A GitHub Actions script to do number 2 above and popular the input_list.json file of all the repositories to be visualized. This could be scheduled to check for updates on some interval. [<em>No work done on this yet</em>]</li><li>Documentation for how to easily add new or update existing visualizations in the explore section pages, such that changes can be easily integrated into already deployed forks. [<em>No work done on this yet</em>]</li><li>Remove more of the list specific content in HTML and Markdown files and have it instead be populated programmatically from the key:value pairs in the _config.yml file in order to get the initial deployment down to a couple hours from several days. [<em>This is 50% complete.</em>]</li></ol><p>More thoughts on this can be found in <a href="https://github.com/softwareunderground/open_geosciene_code_projects_viz">this</a> markdown file.</p><h3>How You Can Contribute</h3><p>This post shared a half-formed idea and working prototype, not a polished easily reusable product. <strong><em>If the idea of an Awesome List add-on that creates visualizations of an open source community interests you</em></strong> and you’d like to contribute, check out <a href="https://github.com/LLNL/llnl.github.io/issues/481">this issue on the original LLNL repository</a> or <a href="https://github.com/softwareunderground/open_geosciene_code_projects_viz/blob/main/changes_needed.md">this markdown files on current and future changes in my repository</a>. You might also appreciate <a href="https://observablehq.com/@justingosses/more-visible-connections-between-projects-can-nudge-devel">this slide</a> pack on Observeablehq.com. It talks about some of the potential benefits from a slightly different angle.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3ccb074f7859" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Alternatives to Iris: Finding Drop-In Replacements for Overused Example Datasets]]></title>
            <link>https://justingosses.medium.com/alternatives-to-iris-finding-drop-in-replacements-for-overused-example-datasets-ecea03b4ad00?source=rss-64df3cb11ba4------2</link>
            <guid isPermaLink="false">https://medium.com/p/ecea03b4ad00</guid>
            <category><![CDATA[data-visualisation]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[data]]></category>
            <category><![CDATA[open-data]]></category>
            <category><![CDATA[new-datasets]]></category>
            <dc:creator><![CDATA[Justin Gosses]]></dc:creator>
            <pubDate>Wed, 19 Aug 2020 03:01:21 GMT</pubDate>
            <atom:updated>2020-08-19T03:01:21.615Z</atom:updated>
            <content:encoded><![CDATA[<h3>Introduction</h3><p>Are you a little tired of seeing the Iris dataset being used in so many code packages and tutorials? Me too. What follows is an exploration of why the Iris dataset is so common as an example dataset, what features we might want to replicate in drop-in replacements for it, how we might find (or make) such a replacement, and some options for sharing such a replacement with others.</p><figure><img alt="photo of an Iris. Specifically, Iris sibirica" src="https://cdn-images-1.medium.com/max/1024/1*inMds3R8uXwdEy5ItPfO_Q.jpeg" /><figcaption>By Diliff — Own work, CC BY-SA 3.0, <a href="https://commons.wikimedia.org/w/index.php?curid=33037509">https://commons.wikimedia.org/w/index.php?curid=33037509</a></figcaption></figure><h3>What Do We Mean By Example Datasets?</h3><p>Example datasets are datasets packaged with a software application or code library, used in a tutorial for how to do something, or compiled in a lists of “good starter datasets”. They are often used when the actual content represented by the dataset is secondary to just having one that is easy to work with. There is an extremely high amount of reuse of example datasets to the point where some of the “standards” are reused in thousands of places.</p><blockquote><a href="http://archive.ics.uci.edu/ml/datasets/Iris">The iris dataset </a>is one of the most used example datasets. It represents measurements of parts of a flower structure for three species of Iris. Wikipedia has <a href="https://en.wikipedia.org/wiki/Iris_flower_data_set">a good description</a> of the dataset.</blockquote><h3>Why Makes for a Good Example Dataset?</h3><ul><li>Column names are immediately understandable by everyone.</li><li>Data content revolves around a question almost anyone could relate to.</li><li>No nulls or other complications that require data preparation.</li><li>Data type, range, and distribution makes the tasks you’re using the example dataset for possible.</li></ul><p>Basically, a good example dataset should require as close to nothing on the part of the end-user as possible in either operations or thought.</p><h3>Why Might We Want Alternatives to the Few Highly Reused Example Datasets?</h3><h4>The Dataset Isn’t Great For Showing ____ Technique</h4><p>Sometimes an example dataset just isn’t great at showing a particular technique or method. It can be done, but it might be a poor example of that specific thing. For example, the Iris dataset doesn’t have columns with properties that allow for features to be created from the raw data and used in machine-learning. It is only a good example for doing operations based on the original columns. This is explained a bit more in <a href="https://armchairecology.blog/iris-dataset/">this blog post</a> by another author.</p><h4>Problematic History</h4><p>Certain datasets can also have problematic histories. The Iris dataset is a good example of this. It was originally published in the Annals of Eugenics by <a href="https://en.wikipedia.org/wiki/Ronald_Fisher">R A Fisher</a>, who had racist views on genetics.</p><h4>Boring</h4><p>Users seeing the same examples again and again can lead to boredom. Some users wish they had a different dataset to use out of boredom. Others wish there was another dataset available for comparison purposes.</p><h4>Example Datasets as a Way to Increase Engagement in a Subject/Problem/Organization</h4><p>Example datasets can also be a way to bring eyeballs and brains to data on a particular topic. Providing a good example dataset could offer benefits to the dataset supplier.</p><blockquote><em>Example datasets can be a way to help people engage with a topic, subject matter, or cause.</em></blockquote><p>For open-data sites that provide a catalog of open-data from a city, state, governmental agency, or non-profit, being able to promote a few datasets as potential drop-in example dataset replacements could be a way to draw users to their content.</p><p>There aren’t many good examples of this, but there are examples of datasets that people have taken, sometimes repackaged into easier to use forms, and reused in many different side projects and tutorials. <a href="https://www.metoffice.gov.uk/hadobs/hadcrut4/">HadCrut4</a> is an example of a climate dataset that has been widely used by end users of a wide variety of skill levels. You might be familiar with as <a href="https://observablehq.com/@fkohlgrueber/warming-stripes">climate stripes</a>. It and NASA’s GISS Surface Temperature Analysis (<a href="https://data.giss.nasa.gov/gistemp/">GISTEMP v4</a>) dataset have been used as example datasets for multiple climate data visualization challenges and <a href="https://observablehq.com/@mbostock/global-temperature-trends">side projects</a>.</p><h3>Suggested Alternatives to Iris by Others</h3><p>If you google “alternatives to Iris dataset”, a variety of things pop up. Some of them are lists of alternatives. Here are two lists:</p><ul><li>4 alternatives to IRIS: <a href="https://www.meganstodel.com/posts/no-to-iris/">https://www.meganstodel.com/posts/no-to-iris/</a></li><li>10 datasets including iris for ML: <a href="https://machinelearningmastery.com/standard-machine-learning-datasets/">https://machinelearningmastery.com/standard-machine-learning-datasets/</a></li></ul><h4>The Penguin Dataset (A drop-in replacement for Iris)</h4><p>There is also a “penguin dataset” that has been put forth by several authors as “the” replacement for Iris as it replicates many of the traits of the original Iris dataset. The dataset is available in a <a href="https://github.com/allisonhorst/palmerpenguins">github repository</a>, as a <a href="https://www.kaggle.com/parulpandey/penguin-dataset-the-new-iris">kaggle dataset </a>with explanation, as a <a href="https://towardsdatascience.com/penguins-dataset-overview-iris-alternative-9453bb8c8d95">tutorial on “toward data science” </a>a Medium channel, and has been used in many tutorials for methods or code packages, such as <a href="https://towardsdatascience.com/how-to-build-a-data-science-web-app-in-python-penguin-classifier-2f101ac389f3">this Streamlit example</a>. The original creator of the dataset, Allison Horst, has <a href="https://allisonhorst.github.io/palmerpenguins/">a nice github pages webpage</a> that goes through use of the dataset for data exploration and visualization. The page contains visuals that do a good job of showing just how similar not only the data structure is to Iris but also the distribution of the classes in the dataset.</p><h3>Why is the Iris Dataset so Popular?</h3><p><em>It is hard to definitively “know” this. What follows are guesses.</em></p><h4>History</h4><p>The Iris dataset was first published in 1936. It’s author published several important works in biostatistics. I first became familiar with the dataset not while writing code but in my high school biology class while being introduced to biostatistics. It is small, easy to understand, and has been around for a long-time.</p><h4>Mindshare</h4><p>In addition to the Iris dataset’s history that predates widespread data analysis and data visualization with code, there’s its more recent history as an example dataset in all the places people find example datasets. Although there are many data archives out there, few of them specialize in example datasets. The <a href="https://archive.ics.uci.edu/ml/datasets.php">University of California Irvine Machine Learning Repository</a> was started in 1987 by David Aha and others. It is one of the options for datasets you’ll get in a google search and probably the oldest of the results. It has Iris as one of its example dataset options. Scikit-learn, Tensorflow, and the R language all have the Iris dataset built in. Tableau, Kaggle, and other online tools also feature the iris datasets.</p><blockquote>When people need a small dataset with numerical columns and categorical labels without any nulls and some overlap between the classes, they think of the Iris Dataset because they’ve already seen it so much.</blockquote><h4>Dataset Attributes (how the information is represented as data)</h4><p>One of the main advantages of the Iris Dataset is that it is simple. There are no nulls. The data is all numerical. The presence of strings or categorical columns would make a dataset slightly harder to work with, and, in some applications, require converting the categories to numerical fields. The fact the user doesn’t have to deal with this, missing values, or any other complication makes it easy to work with as an example dataset.</p><p>Additionally, the three classes of Iris overlap but not completely. This is a useful characteristic as it makes it amenable to tasks that involve prediction of classes, uncertainty analysis, and visualization. If the classes were extremely different, the dataset wouldn’t show uncertainty as well, either numerically or visually. At the same time, if the three classes of Iris overlapped perfectly, applying prediction to it would feel like a waste.</p><blockquote>Part of what makes Iris a good example dataset is the distribution of its data works for a variety of tasks.</blockquote><h4>Dataset Content (what the data represents)</h4><blockquote>To be a good example dataset, the data has to represent something people can easily and quickly relate to.</blockquote><p>The iris dataset revolves around the question of “what type of flower is this”, which is a simple common question that pretty much everyone has thought at one time or another.</p><h3>What are Data Attributes of Iris Might We Want to Mimic in a Drop-In Replacement?</h3><h4>Class names:</h4><p>The names may be in Latin but there’s only three classes. Even if the end user doesn’t have a lot of familiarity with different irises, they likely understand the concept that there are different types of them. The class names for any drop-in iris replacement shouldn’t require any extra work by the end user to understand the dataset.</p><h4>Data Structure:</h4><p>A flat data structure is generally easier for people to work with in the widest variety of tools. This means there is no nesting. There are only columns and rows. Each instance of an iris is a new row. All rows have the same number of columns.</p><h4>File format:</h4><p>The Iris dataset is stored in a variety of formats in different places by different parties. If one was storing a new dataset, CSV files are likely the easiest to open file format making use available to the highest number of people. If it can be stored in CSV, it can be stored in anything more complicated.</p><h4>Number of Columns:</h4><p>The iris dataset has 4 data columns. Datasets that have 1 or 2 data columns probably rule out some tasks. Datasets with 200 columns require too much investigation by the end user. Although there are probably exceptions to this, we’re probably looking for a dataset with 3–10 data columns in addition to the classes column.</p><h4>Data Type of Columns:</h4><p>Numbers are the easiest data type for most people to use. Categorical strings are next easiest. Nulls require additional actions by the end-user. Arrays and dictionaries nested inside columns again require additional actions by the end-user and might be out of reach of some users’ technical abilities.</p><h3>What are Data Content Characteristics Might We Want to Mimic in a Drop-In Replacement?</h3><p>These characteristics are harder to define. They have to do with whether people can understand the dataset without additional information and whether they can relate to the question it poses.</p><p>As an example, let’s imagine an example dataset with class labels “A”, “B”, “C” and feature columns with names “something”, “somethingElse”, and “otherThing”. The data columns contains floats with a distribution identical to Iris. Would that make a good example dataset? Probably not. It isn’t relatable.</p><h4>Column Names:</h4><p>Ideally, column names should be understandable to anyone. Users shouldn’t have to read material about the dataset. Ironically, it could be argued the iris Dataset fails this as not that many people could define a ‘sepal’, the length and width of which makes up two columns in the iris datasets. Wikipedia has a nice definition of sepal with pictures <a href="https://en.wikipedia.org/wiki/Sepal">here</a>.</p><h4>What Question Does the Dataset Answer:</h4><p>A dataset that answers an obvious question if preferable over one that does not. A abstract way to phrase the question the iris dataset helps answer is “There are several types of ___, and this data can be used to distinguish each type”. That phrasing gives us some potential guidance on where to look for drop-in replacements that answer similar questions. There are many other things that have classes of ___ and multiple numerical characteristics that describe instances of each type whose data distributions only partially separate.</p><blockquote>Animals, plants, minerals, rock types, planetary bodies, cars, book genres, and movie genres are all potential places to look for drop-in replacement example datasets.</blockquote><h3>Can we Find Alternatives P<strong>rogrammatically</strong>?</h3><p>Downloading many datasets one after another to examine their characteristics by hand does not sound like a fun experience. Doing the same task programmatically might speed things up a bit.</p><h4>Data Characteristics that could be Determined Using Code?</h4><ul><li>File format</li><li>Data structure (flat or nested)</li><li>Number of columns</li><li>Number of rows</li><li>Number of labels</li><li>Data types of each column</li><li>Level of overlap between the data for each label class</li></ul><h4>Difficulties Programmatically Profiling Datasets in Bulk</h4><p>Although code exists to load CSV and JSONs and determine the features above, getting the many files programmatically in the first place is often the bigger challenge. Many large data catalogs don’t hold the datasets themselves, only the metadata. Often this metadata lacks a direct link to the files but instead have a link back to the original data catalogs landing page. Programmatically getting datasets from a data catalog that doesn’t hold the data files itself can get very complicated quickly. Too many data catalogs assume a human will always be in the loop somewhere in the process. If data profiling works, it is often because you are working with a data catalog that holds the data itself and the metadata contains both direct download links and file format information.</p><h4>When Searching for Potential Alternative Datasets, What Likely Requires a Human?</h4><p>Certain dataset characteristics are hard to get at programmatically. What question a dataset seeks to answer and whether that question is one a large percent of the user population will be able to relate to easily is difficult to get programmatically from the data and metadata alone. Whether column names are understandable on first glance is another question that is hard to answer programmatically.</p><h3>Where to Find Potential Example Datasets?</h3><p>When searching for open data that can be reused as example datasets, you can start searching in very large meta catalogs that ingest smaller catalogs or you can start searching in data catalogs focused on a single topic that hold the data themselves. The former is better for random discovery but the filtering power is often poor. The later option often has better filtering capabilities, and you’re more likely to be able to filter based on file format, number of columns, size of file, etc.</p><p><strong>Data</strong><a href="http://data.gov"><strong>.</strong></a><strong>gov</strong> is a collection of data catalogs from various federal agencies. It is extremely large. Unfortunately, most of the datasets lack good metadata on file format or file size. They also frequently don’t have direct download links to files, requiring a human to click through a few pages to get to the actual dataset.</p><p>There are also datasets available for specific categories of things. For example, this is a <a href="http://webmineral.com/"><strong>mineral</strong></a><strong> database</strong>. Although it is very large, it is possible some small subset of it could be used as a replacement for Iris. In fact, you might be able to create a program that could generate different combinations of subsets of this dataset for use as example datasets. There are likely other data catalogs focused on large categories of ___ that might be good searching grounds for iris replacements.</p><h3>Instead of Finding Datasets that would Make Good Example Datasets, Could We Create One?</h3><p>From a code standpoint, it should be very easy to create a small CSV with labels that are three strings and four columns of numerical data. Generating data with distributions necessary for a range of tasks is slightly more difficult. Perhaps the hardest part might be coming up with a story about those fake values that is interesting, immediately understood, and not diminished by the fact that the data is fake.</p><h3>How Might We Share Alternative Example Datasets with Others?</h3><h4>Submit a Pull Requests to Add Your Example Dataset to the Example Datasets Used by a Heavily Used Code Package</h4><p><a href="https://vega.github.io/vega/">Vega</a>/<a href="https://altair-viz.github.io/">Altair</a> are related code libraries that have said on twitter that they accept example dataset pull requests from members of the public.</p><h4>Add Your Replacement Example Dataset to Popular Lists of Example Datasets</h4><p><a href="https://www.kaggle.com/">Kaggle</a> is an example of this. There are also Awesome Lists of example datasets, such as <a href="https://github.com/awesomedata/awesome-public-datasets">here</a>, <a href="https://github.com/jdorfman/awesome-json-datasets">here</a>, and <a href="https://www.kdnuggets.com/2015/04/awesome-public-datasets-github.html">here</a>.</p><h4>Provide Easy to Use Example Datasets on Front-page of a Data Catalog</h4><p>Many times large data catalogs have the problem that most members of the general public want simple, easy, and small datasets to work with for some side project, a hackathon, or a homework assignment. Unfortunately, their search results are swamped by many large, complex datasets meant for specialists that make up the bulk of the datasets in the catalog.</p><p>Presenting a few example datasets from within a larger data catalog and pointing out what well known example datasets they’re similar to might be a way to get lower-skill level end users to take advantage of the large data catalogs.</p><h3>Related Things that Don’t Exist Yet But Could…</h3><p>There are a variety of things that could be built that relate to the ideas discussed in this post.</p><p><strong><em>Themed Collections of Example Datasets</em></strong></p><p><em>For an organization that wants to attract users to their data catalog, it might be possible to find a handful of datasets that could serve as drop-in replacement example datasets for the standards. These could be included into tutorials and code packages and offer a way for end users to discover their data catalog. For example, what if NASA had example datasets that served as drop-in replacements for the standard example datasets: Iris, Boston Housing, Wine Scores, and Titanic?</em></p><p><strong><em>Tooling to Help Find Similar Datasets</em></strong></p><p><em>What if there was a data profiling tool that scanned CSVs and JSONs and identified datasets most similar to a user provided dataset? What if it was a CKAN add-on (CKAN is a common software for running open-data portals) that could be easily added to various open-data catalogs that already exist?</em></p><p><strong><em>Let Users Pick From Multiple Similar Example Datasets</em></strong></p><p><em>What if code packages had example dataset tooling that let people pick from not just 3–5 datasets but 5 example dataset types and 3 examples of each type. Slightly different data distributions might enable different aspects of the package to be better understood.</em></p><p><strong><em>Tooling to Create Fake Example Datasets that Mimic Well-known Ones</em></strong></p><p><em>What if there was a simple web-app that would help people create fake datasets with interesting stories whose data distributions fit with tutorial goals?</em></p><blockquote><strong>Maybe these are things you could create?</strong></blockquote><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ecea03b4ad00" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Stratigraphic top prediction in well logs via machine-learning: Predictatops]]></title>
            <link>https://justingosses.medium.com/https-medium-com-justingosses-stratigraphic-pick-prediction-via-supervised-machine-learning-predictatops-841cb5fc3efb?source=rss-64df3cb11ba4------2</link>
            <guid isPermaLink="false">https://medium.com/p/841cb5fc3efb</guid>
            <category><![CDATA[xgboost]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[stratigraphy]]></category>
            <category><![CDATA[geology]]></category>
            <category><![CDATA[sideprojectlove]]></category>
            <dc:creator><![CDATA[Justin Gosses]]></dc:creator>
            <pubDate>Sun, 11 Aug 2019 18:45:53 GMT</pubDate>
            <atom:updated>2019-08-12T00:35:49.364Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/567/1*tkWhxErxnQcXr5HNSKeubw.png" /><figcaption><a href="https://github.com/JustinGOSSES/predictatops">https://github.com/JustinGOSSES/predictatops</a></figcaption></figure><p>Back in May, I presented <a href="https://github.com/JustinGOSSES/predictatops/blob/master/docs/ACE2019_Gosses_theme8_StratigraphicTopML_201905018_submitted.pdf">a talk</a> at the 2019 AAPG ACE (American Association of Petroleum Geologist Annual Conference and Exhibit) on using machine-learning to predict stratigraphic surfaces in well logs. I described a Python package I have been working on as a side project called <a href="https://github.com/JustinGOSSES/predictatops">Predictatops</a>. DOI <a href="http://Gosses, J.C., 2019, JustinGOSSES/predictatops: v0.0.3: Zenodo, doi:10.5281/zenodo.3247092.">here</a>. Given I’ve mentioned working on the issue of stratigraphic top prediction using machine-learning in <a href="http://justingosses.com/thoughts-on-machine-learning-predictions-of-chronostratigraphic-surfaces/">previous blog posts</a> on my personal webiste, I thought it wise to announce Predictatops here on my website as well.</p><blockquote><em>“which is obviously also been copied to medium as that’s where you’re reading it now”</em></blockquote><p>First, some definitions for the machine-learning folks who stumbled into a geologist blog post and the geologists who stumbled into a machine-learning blog post…</p><p><strong>Wells:</strong></p><p>Holes drilled in the ground.</p><p><strong>Well Logs:</strong></p><p>Well logs are created when a geophysical measurements are made along a well. These are usually 1D measurements in the sense that a measurement is made at the bottom, then the tool is pulled up a little bit and then another measurement is made.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/223/0*CPiqZvgkRp6dr1ci.png" /><figcaption>Well logs from a single well to the left in this image. Interpretation of lithology of the rocks to the right in the colored bars. *Image from: Hall, B. (2016) Facies classification using machine learning, The Leading Edge, 35 (10): 906–909.</figcaption></figure><p>This continues until the entirety of the well is measured. Often, many different types of tools are used that measure different properties, resulting in multiple well logs for a single well. They measure things like how fast sound travels between two parts of the well or how much signal is bounced back and measured by a tool after a certain type of radiation is given off by a tool. These measurements are then turned into rock properties like density, grain size, etc. Further explanations of well log measurements are <a href="https://www.rigzone.com/training/insight.asp?insight_id=298&amp;c_id=">here</a> and <a href="https://www.spec2000.net/05-logaliastable.htm">here</a>. The picture below is of a well log.</p><p><strong>Tops:</strong></p><p>Tops are markers for the tops of things. Specifically, tops of geologic units. Everything below a top is one geologic layer and everything above a top is another geology layer.</p><p><strong>Different Types of Tops:</strong></p><p>Tops can divide different categories of layers. Sometimes the layers are based on characteristics of the rock. For example a top can separate rock made from small grains from rock made of large grains. Tops based on physical make-up of rocks are often called lithologic tops. Lithology is term for describing what a rock is made up of. Facies is another term that refers to categories of rocks in a well with similar characteristics.</p><p>Alternatively, tops can be boundaries based on more actionable characteristics like the ability for fluids to flow. A top might separate a geologic layer where you think the rock will allow fast flow of fluids like water, oil, or gas from a layer below the top where fluids will flow very slowly.</p><p>Geologist’s favorite way to place a top, however, is to place tops based on time. A top can separate rock deposited at one point in time from rock deposited at the next point of time. This is a stratigraphic or chronostratigraphic top! Also, called a time surface.</p><p>You might have noticed that the lithologic, flow-based, and stratigraphic tops are increasingly abstract.</p><p>The first, lithologic-tops are based on what the rock is made up of, which can be measured, at least indirectly. Flow is a little harder to predict from well logs but the ability of fluids to flow is fundamentally also based on physical characteristics, which can be measured, just with more difficulty as very small characteristics at the level of pores in rocks are what matter.</p><p>Stratigraphic tops are based on age of the rocks. There is no measurement that relates to time that can be done routinely in well logs. You can collect fossils and interpret time based on the fossils found in different units (biostratigraphy). You can find volcanic ash and date it by looking at how much one type of element has turned into a different type of element due to decay (geochronology). However, neither of these can be done on all, or even most, wells. They’re too expensive and time consuming. Not all depth points will have fossils or ash layers for dating.</p><p><strong>Stratigraphic Correlation:</strong></p><p><em>How then does one go from well logs to stratigraphic tops representing time surfaces? That requires a model, and a head to put it in.</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/974/0*kqJ4sko77uNT6Bqh.png" /><figcaption>The top picture shows lithostratigraph correlation where the assumption is that rocks with similar characteristics are deposited at the same time. The bottom is a chronostratigraphic interpretation where models from outcrop studies of deltas are used as a guide. In the deltaic depositional environment, the coarser grains are deposited first and then finer grains, which build up over time as sheets angled towards the deeper water. The lower interpretation is correct. <a href="https://www.uh.edu/nsm/_docs/geos/faculty-files/pdf/Gani_Bhattacharya.pdf">* Figure from: Gani and Bhattacharya </a>(2005) Lithostratigraphy Versus Chronostratigraphy In Facies Correlations of Quaternary Deltas: Application of Bedding Correlation, River Deltas — Concepts, Models, and Examples SEPM Special Publication №83, SEPM (Society for Sedimentary Geology), ISBN 1–56576–113–8, p. 31–48</figcaption></figure><p>In practice, chronostratigraphic (mapping out time surfaces) well log correlation (correlation means interpreting where a top in well A exists in well B) is a combination of lithostratigraphic correlation (looking at 2 wells and matching curves of the well logs using the assumption that similar looking curves, have similar properties, and are the same layers) and application of conceptual models. These conceptual models cover how sediment is transported and deposited. They are very helpful for stratigraphic correlations as they predict the spatial distribution of different types rocks deposited at the same time and how those spatial associations can change over time. These conceptual models come from two places, outcrop studies (going out in the field and looking at rocks) and modern analogues studies (going out to the valley, rivers, lakes, oceans, etc. and seeing how sediment gets deposited). The geology name for these conceptual models is depositional environments. Some additional information on them can be found <a href="https://wiki.aapg.org/Depositional_environments">here and</a> <a href="https://en.wikipedia.org/wiki/Depositional_environment">here.</a> Another key conceptual model is <a href="http://www.sepmstrata.org/page.aspx?&amp;pageid=32&amp;3">sequence stratigraphy</a>.</p><p>You can read more about stratigraphic well log correlation from <a href="http://www.sepmstrata.org/page.aspx?pageid=61">this page by SEPM </a>(a national sedimentology association).</p><p><em>Just to be clear, with Predictatops, we’re wanting to do chronostratigraphic correlation, not lithostratigraphic correlation.</em></p><p><strong>Supervised Machine-learning:</strong></p><p>In our context, supervised machine-learning means, instead of letting the computer going on fun field trips like geologists to gradually build up mental conceptual models to use for correlating well logs chronostratigraphically, we’ll give the computer a dataset of already human-picked tops for one time surface and ask the computer to figure out a model that lets it mimic the geologist.</p><p>For more detailed explanations of machine-learning, there are lots of things on the web that google will provide for you. I’m a fan of the <a href="https://towardsdatascience.com/explaining-supervised-learning-to-a-kid-c2236f423e0f?source=---------62------------------">medium articles</a> by “Cassie Kozyrkov” whose title is “Chief Decision Intelligence Engineer, Google” and does a good job at packaging key points in an dense but fun to read manner.</p><p><strong>Building Geologic Observations into Features:</strong></p><p>Features in a machine-learning context are new data characteristics built from the original data. An basic example might be the sum of three other original data characteristics. Feature creation is a very common part of machine-learning. Rarely would you only use original raw data.</p><p><em>Unlike some of the demo datasets traditionally used in machine-learning demos where each row of the dataset is an independent entity and features are only created within each row, a key aspect of building features for stratigraphy applications is that a lot of valuable information can be gleaned if one creates features based on comparisons or aggregate observations from multiple depth points or even across wells.</em></p><p>One type of comparison is between each depth point in question and the depth points above, below, and around it within different length windows. Another type of comparison is between the characteristics of the well that holds the depth point being predicted for and the neighboring wells.</p><p>These comparison-based features are similar to what a geologist does visually when they put wells in a cross-section, or a sequence of wells’ well logs, and attempt to pick where stratigraphic tops should be correlated as shown below.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*WBa-9CJQROz0_Km7.png" /><figcaption>A cross-section showing different tops in different wells. Vertical curvy lines are gamma-ray well log curves. Alberta Geological Survey Open File Report 1994–14. Cross-section B to B`. * Image from: Hein, F. J., and Dolby, G., 2001, Regional lithostratigraphy, biostratigraphy and facies models, Athabasca oil sands deposit, northeast Alberta: Ann. Conv. Proc. Rock the Foundation (Calgary), Can. Soc. Petroleum Geologists, 3 p.</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/241/0*uT487whE3_Q7OY9r.png" /><figcaption>Map of Mannville wells in the Alberta Geological Survey open file report 1994–14 that are used in the project I’m doing at the github link. * Image from: Hein, F. J., and Dolby, G., 2001, Regional lithostratigraphy, biostratigraphy and facies models, Athabasca oil sands deposit, northeast Alberta: Ann. Conv. Proc. Rock the Foundation (Calgary), Can. Soc. Petroleum Geologists, 3 p.</figcaption></figure><p>Only creating features based on data in individual depth points in individual wells would like cutting up all your well logs from all your wells into little bits, giving them to your geologist in a bucket, shaking the bucket, and then ask the geologist which depth point pieces are the same depth as the top you’re trying to predict for. Obviously, that wouldn’t work out so well. If you were to set up a stratigraphic machine-learning project and only create features from data at the same depth points for each well, you’d be doing the same thing to your machine-learning model.</p><h3>Predictatops: Supervised Machine-Learning of Stratigraphic Surfaces</h3><p><strong>Predictatops’ Supervised Machine-learning Goal:</strong></p><p>The goal of Predictatops is to be able to give it a training datasets of wells (that include both well logs and trusted human-assigned tops) for a single time surface and a test datasets of just well logs from many wells and get back where the machine-learning model thinks the top should be in each well in the test dataset.</p><p><strong>How to judge success:</strong></p><p>We want to minimize the distance between where the ML model predicted the top should be in each well and where the human(s) put the top. That difference will be our error. For comparing different runs we’ll return the RMSE (root mean squared error) for the top we’re trying to predict across all the wells.</p><p>We could have picked a different way to judge prediction success, so I’ll explain a bit about why I think this is a better way.</p><p>You might be tempted to set the problem up as a classification problem where we don’t predict a single pick but rather for each depth point in every well predict what formation (another geology word for layer) that depth point is. The problem with doing it this way, is that geologist really don’t care whether you got the formation correct far away from any top. That might not even be a hard problem. Additionally, how good your statistics appear to be will be affected by how thick the layers are, which isn’t ideal.</p><p>You could also frame the problem in terms of a binary question of whether or not the top was predicted to be in the exact same place as the geologist put it. The problem with this approach is two-fold. First, you immediately have a huge imbalanced class problem on your hands. If your well logs have a different measurement every 1/3 of a meter and typically are 300 meters long, every well will have 1 instance of data to represent the top and 899 instances to represent not the top. This will make machine-learning very difficult. Second, a binary prediction approach will affect the way you judge accuracy resulting in some not particularly useful information. You might get 4% of the depths predicted exactly right and 96% wrong. However, if all your 96% wrong tops are within plus or minus 2 meters of the actual tops, that’s amazing good. If the average distance between the actual top and predicted top is plus or minus 56 meters, that’s not good. In both cases, you were only 4% accurate.</p><p><strong>Project setup:</strong></p><p>If we can’t set-up the problem as a binary machine-learning program or a classification machine-learning problem, how do we set up the problem?</p><p>In Predictatops, we handled this by making it a two-step prediction with the first step being classification, not on formation labels, but on distance zones away from the pick we were trying to predict.</p><p>Imaginary zones were created at the pick, slightly away from the pick below, slightly away from pick above, farther form pick below, farther from pick above, and everything else. We then ran classification to predict the zone of each depth point in each well. Those results were then run through an additional process that produced higher numbers for depth points that had the most predictions for being at the top or very near to the top around it. More of this process is described in <a href="https://github.com/JustinGOSSES/predictatops/blob/master/docs/ACE2019_Gosses_theme8_StratigraphicTopML_201905018_submitted.pdf">this presentation given at AAPG ACE</a> and in the <a href="https://justingosses.github.io/predictatops/html/index.html">documentation of Predictatops</a>.</p><p><strong>Data Requirements:</strong></p><p>The supervised part of supervised machine-learning means you have to have human-labeled tops to start. In addition, you have to have enough wells from enough places that the full complexity of the stratigraphy is captured in the training wells. The exact number required is hard to define as it depends on the complexity of the problem, how generalizable we want to the solution, and the algorithms used. If you had to press me, I’ll say you’ll need several hundred tops, if not more than a thousand, for your training datasets.</p><p>While we’re discussing training dataset sizes, a point of interest here is that the number of required training wells goes down is someone is kinda enough to create and release open-source a pre-trained model on say 200,000+ wells in a depositional environment similar to yours. Then you might be able to use that model as a starting point and retrain it with your use-case specific training dataset. This is a over-simplistic description of transfer-learning, which has resulted in incredible gains in image and natural language processing machine-learning.</p><p>An additional data requirement is that these wells have the same, or at least very similar, well log types. Machine-learning likes all the inputs to be collected and prepped the same way. Although this may seem like a simple ask to those not in oil &amp; gas, the reality is that many wells will be drilled with different well logs. Some types will be always present. Some types will be mostly present. Others rarely present. Some types, like sonic, will be very common but different tool vendors used that measure the same property in different ways. Well log normalization by petrophysicists may be required.</p><p>In the demo dataset from the McMurray formation, a collection of well logs that were in the highest number of wells were found. The impact of that constraint was that instead of 2000 wells used, just over 1200 wells were used. There are classes in Predictatops to help identify what well logs are the most common in a given dataset.</p><p><strong>Predictatops Project Status:</strong></p><p>Predictatops is functional but still changing. I tried to organize things such that large pieces could be skipped and others added without too much trouble. It has a demo data from the McMurray in Alberta, Canada and instructions for how to run it with that demo dataset. The documentation is still a work in process but the basics are all there. I’ve made some changes to make it more generalizable, but there is more to be done in that area.</p><p><strong>Predictatops Performance:</strong></p><p>The top McMurray currently has a RMSE (root mean squared error) of 6.6 meters. More information is available in <a href="https://github.com/JustinGOSSES/predictatops/blob/master/docs/ACE2019_Gosses_theme8_StratigraphicTopML_201905018_submitted.pdf">this presentation given at AAPG ACE</a> and in the <a href="https://justingosses.github.io/predictatops/html/index.html">documentation of Predictatops</a>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1012/0*NPpuaugzZYcT6ytu.png" /><figcaption>A histogram showing the distribution of McMurray Top prediction error. Bars count how many wells where in each group. Groups are based on difference in depth between actual human-picked top and machine-picked top. RMSE is 6.6 m.</figcaption></figure><h4>Wait, but I don’t trust my geologist.</h4><p>Yes, one limitation of all of this is that fundamentally the machine-learning model is trying to mimic a person’s (or multiple people’s) top interpretations. The best you’ll get out of the machine-learning model in terms of performance accuracy is slightly worse than the training dataset supplied by human geologist(s).</p><p>This is also a reason you might want to only use training data consisting of tops interpreted by a single geologist and not two geologists with different models in their heads.</p><h4>When would you want to use this type of approach?</h4><ol><li>When you don’t have enough time for a geologist to interpret all the wells, but you have enough time to interpret 1000 out of 7000 wells.</li><li>When you have two areas worked by different geologists and you want to see how the interpretation of geologist A transfers to the neighboring area worked by geologist B.</li><li>When you want to identify the 5% of the wells with the most uncertainty and have your best geologists focus on those.</li></ol><h4>Are there uses for this type of approach even if I already have geologist-generated tops on all my wells?</h4><p>Yes, the great thing about this type of approach is it scores each depth point in every well. That score can be normalized to a probability of how likely each depth point is to actually be the top. There are wells where only one depth point or a small cluster of depth points have higher scores. Other wells have high scores spread more widely throughout the well. Additionally, some wells have predicted tops at similar depths to their neighbors. Other wells do not. <em>This information can be used to generate uncertainty scores/maps/curves and help geologists know where to focus their interpretation efforts.</em> I haven’t done this with Predictatops yet, but it is totally possible!</p><h3>Making it easy to use the demo dataset:</h3><p>As mentioned before, the <a href="https://ags.aer.ca/publications/SPE_006.html">demo dataset</a> comes from the McMurray formation in Alberta, Canada. There’s information on it in the <a href="https://github.com/JustinGOSSES/predictatops">README.md</a> and <a href="https://justingosses.github.io/predictatops/html/index.html">documentation of Predictatops</a>. The full reference is: Wynne et al., (1994) Athabasca Oil Sands Database McMurray/Wabiskaw Deposit, Open-File-Report 1994–14, Alberta, Canada; Alberta Geological Survey. Links to <a href="https://ags.aer.ca/document/OFR/OFR_1994_14.PDF">report</a> &amp; <a href="http://ags.aer.ca/publications/SPE_006.html">dataset</a>.</p><p>This is a great open-source dataset from the Alberta Energy Regulator and Alberta Geological Survey that could be used in a lot of other machine-learning focused work as it is one of the larger open-source datasets of a single formation already compiled and prepped by a single group.</p><p>Discovering, evaluating, loading, and transforming data takes a lot of time. There’s always a risk you’ll get through some or all of those steps and then discovery the dataset won’t work for your purposes. This dataset is already put together and can be used for both stratigraphic top prediction and facies prediction.</p><p>I’ve been working on a pull request for the <a href="https://github.com/fatiando/rockhound">Rockhound</a> Python package, a project to make loading geologic demo datasets super quick and easy, that will let you pull in the fully prepped and merged McMurray dataset with two lines of code. Currently, <a href="https://github.com/JustinGOSSES/rockhound">there is this pull request</a> for the McMurray dataset prepped for facies prediction. Once that pull request is accepted, I’ll push another dataset prepped for stratigraphic prediction.</p><h3>Other Machine-Learning Approaches to Stratigraphy</h3><p><strong>Lithostratigraphy:</strong></p><p>Lithostratigraphy is basically curve matching, so computational approaches go back to the 1970s. Some of the better results seem to be geologic specific variations on dynamic time warping. One example is <a href="http://www.kgs.ku.edu/Publications/OFR/2002/OFR02_51/ManualOFR2002-51.pdf">CORRELATOR</a> from Kansas Geological Survey, but there are more modern approaches in Python as well.</p><p>The reasons for why this type of computational approach seemed to have never caught on are likely complex. I don’t claim to understand all of them, but I suspect it was related to the fact that they often didn’t return a single surface but rather an arbitrary number of surfaces. This meant the geologist still had to look at the well logs in order to use the results. Hence, the time savings aren’t there.</p><p><strong>Chronostratigraphy:</strong></p><p>If you’re interesting in this type of thing, here are two other recent examples of applying machine-learning to stratigraphy that I think are very interesting.</p><ol><li>Alex Bayeh, Michael Ashby, Darrin Burton, and Seth Brazell at Anadarko (now Occidental) published “<a href="https://www.onepetro.org/journal-paper/SPWLA-2019-v60n4a1?sort=&amp;start=0&amp;q=brazell&amp;from_year=&amp;peer_reviewed=&amp;published_between=&amp;fromSearchResults=true&amp;to_year=&amp;rows=25">A Machine-Learning-Based Approach to Assistive Well-Log Correlation</a>”, which uses thousands of pairs of well logs that are and are not representing the same layer to train a model, which is then used with a small number of tops from a specific formation and tops for that formation predicted. I was excited to see this type of approach because (1) I thought an approach sort of like this might be possible but don’t have access to large enough open-source dataset to actually attempt it myself (2) it demonstrates a different approach to supervised machine-learning applied to stratigraphy.</li><li>Additionally, there’s been a variety of papers trying to apply wavelet transform theory to well log correlation for the past two decades. My opinion of these approaches has typically been that there is a lot of complexity without that much to show for it in terms of useful predictions. A recent exemption to this was <a href="https://www.onepetro.org/conference-paper/SPE-183860-MS">Ye at al., </a>2017’s “<em>Rapid and Consistent Identification of Stratigraphic Boundaries and Stacking Patterns in Well Logs — An Automated Process Utilizing Wavelet Transforms and Beta Distributions</em>”, which looks like it would be an excellent feature creation step to use in supervised, or event unsupervised machine-learning, applied to stratigraphy.</li></ol><h3>Contributing to Predictatops</h3><p>Pulls requests and issues (in the form of bugs, enhancements, comments, and even idle observations) are very welcome on Predictatops. I currently have 18+ issues on the repository for things to do. It is a side project, so please don’t expect it to be perfect, but interested in hearing feedback and other peoples’ approaches for this type of problem.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=841cb5fc3efb" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Geoscience to Data Science Starter Pack]]></title>
            <link>https://justingosses.medium.com/geoscience-to-data-science-starter-pack-5828ccc59be1?source=rss-64df3cb11ba4------2</link>
            <guid isPermaLink="false">https://medium.com/p/5828ccc59be1</guid>
            <category><![CDATA[career-change]]></category>
            <category><![CDATA[geoscience]]></category>
            <category><![CDATA[learning]]></category>
            <category><![CDATA[geology]]></category>
            <category><![CDATA[data-science]]></category>
            <dc:creator><![CDATA[Justin Gosses]]></dc:creator>
            <pubDate>Fri, 25 Jan 2019 14:18:54 GMT</pubDate>
            <atom:updated>2019-02-03T19:09:32.124Z</atom:updated>
            <content:encoded><![CDATA[<h3>Why Write this and Who is it targeting?</h3><p>I wrote a blog post, <a href="http://justingosses.com/learning-to-code/">LEARNING TO CODE</a>, on my website in early 2016, three years ago. The premise of that blog post was a summary of the different styles of learning you could pick from when trying to learn how to code. This blog post, like that one, was prompted by the realization that I had the same conversation with two different people within a single week. They were asking the same questions, so I might as well write everything down.</p><blockquote><strong>This post is directed at houston-based geoscience types starting off on a month to years long process of improving their skills in data science and maybe eventually getting a job in data science. It lays out the things I’ve found myself telling people in real life.</strong></blockquote><h3>Step 1: Figure out if you’re interested in this type of thing….</h3><p>I’ve not seen a lot of writing on the best way to do this. The best path forward may be a personal decision to a large degree.</p><p>If you have kids and want to involve them in your first steps,<a href="https://code.org/"> code.org</a> and <a href="https://scratch.mit.edu/">Scratch</a> are two resources to try out if you haven’t written any code. Both are designed for kids but still kinda cool. They’ll let you see what kind of logic writing code uses but often doing so in a pictorial form that doesn’t require memorizing any syntax .</p><p>You might also want to try some shorter lessons of 1–5 hour length on sites like <a href="https://www.codecademy.com/">code academy</a> or take your time going through an introduction to any of the languages on <a href="https://www.w3schools.com/">w3schools</a>.</p><p>If you’re more motivated by what you can eventually do, you might try watching a few videos of talks from any of the <a href="https://www.youtube.com/playlist?list=PLYx7XA2nY5GfdAFycPLBdUDOUtdQIVoMf">SciPy</a> conferences or the machine-learning videos from <a href="https://www.youtube.com/channel/UCsX05-2sVSH7Nx3zuk3NYuQ">PyCon</a>. They’ll be partially over your head, but they can still be very interesting. You can also take a look at the blog posts summarizing what projects were made during <a href="https://agilescientific.com/blog/2018/12/17/the-london-hackathon">geology hackathons by AgileScientific</a>.</p><h3>What Language to Learn?</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/601/1*PPIp7twJJUknfohZqtL8pQ.png" /></figure><h4>Python</h4><p>The favorite. Different computer languages are better for different tasks. They also change in popularity over time. There used to be Python vs. R for data science debates, but those have faded recently as Python has won over more people. Two libraries you’ll use often that also have good documentation &amp; lots of video tutorials are <a href="https://scipy.org/scipylib/">SciPy</a> and <a href="https://scikit-learn.org/stable/">Scikit-learn</a>. If you want to try NLP (natural language processing) <a href="https://spacy.io/">SpaCy</a> has maybe the best documentation of major Python machine-learning libraries.</p><h4>R</h4><p>While Python tends to dominate the hard sciences and to a decent extent machine-learning, R still leads among the social sciences. There is interesting geoscience computing done in R, most is done in Python.</p><h4>Other languages that don’t start with Pytho</h4><p>Python is a very intuitive computer language as far as these things go, so jumping to another language can be a somewhat painful experience, at least initially. If you start in Python and are starting to grasp the language, I’d encourage you <em>not to stay with only Python</em>. One, it limits what you can do. Although capable of a lot, Python isn’t good at everything. Two, you’ll become better at programming once you can hop between languages. There will be people, sometimes people with an incentive, who might say things at a SciPy conference like, “I am only an astrophysics PhD, I can’t be expected to understand something difficult like JavaScript” or “If you learn JavaScript then you’ll be a web developer and not a scientist”. Those people are wrong. Ignore them.</p><p>JavaScript (and HTML,CSS)</p><p>Although you will probably start off with Python. Picking up JavaScript as language number two would be my recommendation. The web runs on HTML, CSS, JavaScript, browsers, and pictures of cats. If you want to build anything with a decent human interface, visualize data in a slightly non-standard way, or reach people online, having some JavaScript knowledge is powerful.</p><p>An important point to make when you talk about JavaScript is that <a href="https://www.w3schools.com/js/default.asp">plain JavaScript</a>, or what is sometimes called <a href="https://snipcart.com/blog/learn-vanilla-javascript-before-using-js-frameworks">Vanilla JavaScript</a>, is perfectly fine most of the time. There are lots of JavaScript frameworks you could theoretically pick up, <a href="https://reactjs.org/">React</a>, <a href="https://vuejs.org/">Vue</a>, <a href="https://angularjs.org/">Angular</a>, etc., but I tend to have a <em>“use it only if you have a well demonstrated need”</em> perspective on JavaScript frameworks. If you end up doing a large, complicated <a href="https://en.wikipedia.org/wiki/Front_and_back_ends">front-end project</a> with lots of state management, that’s when you should consider a JavaScript framework.</p><p>C++</p><p>C++ and Java are the languages most often learned by Computer Science majors in university. There are good reasons for this and not quite so good reasons for this. Certain things, like highly dependable applications, embedded applications, and low-level high performance computing is done in C++. If you are a geophysicist and did some in school, it might be a place to continue. If not, it probably isn’t the place to start.</p><p>Java</p><p>Some of the things that could be said about C++ could also be said about Java. There is a fair amount of machine-learning done using Java when it is done via <a href="https://en.wikipedia.org/wiki/Distributed_computing">distributed computing</a> on big data. <a href="https://towardsdatascience.com/deep-learning-with-apache-spark-part-1-6d397c16abd">Spark</a> is an important tool in that space to at least know about. If you’re interested in Spark but want to stick to Python, there is also <a href="https://towardsdatascience.com/a-brief-introduction-to-pyspark-ff4284701873">PySpark</a>.</p><p>For a more detailed and humorous explanation, there is always <a href="http://carlcheo.com/wp-content/uploads/2014/12/which-programming-language-should-i-learn-first-pdf.pdf">this infographic</a> representing computer languages as Lord of the Rings characters, which never goes out of style assuming you’ve already seen the movie.</p><h3>How to learn?</h3><h4>In-person Bootcamps, Online Courses, Online Lessons, etc.</h4><p>As mentioned above, I previously wrote a blog post in 2016 about the <a href="http://justingosses.com/learning-to-code/">different types of ways to learn how to code</a>. We all have different learning style preferences. We also have different life constraints that affect what methods fit into our life style. Much of what I wrote then ties in closely with this post but from a more generic learning to code and less data science specific perspective.</p><h4>Useful Things that Didn’t Exist (I think) in January 2016</h4><p>One thing of importance to note is that <a href="https://notebooks.azure.com/">Microsoft Azure notebook</a>s and <a href="https://colab.research.google.com/notebooks/welcome.ipynb#recent=true">Google Colab</a> didn’t exist in January 2016. If they would have, and I knew about them, I would included them in the previous blog post. These are similar to a <a href="https://jupyter.org/">Jupyter Notebook</a> but run in the cloud and are accessed via your browser. They will let you get started writing Python without having to deal (at least initially) with the often messy process of installing languages, editors, and code libraries locally on your computer. If you do install things on your local computer, the <a href="https://www.anaconda.com/">Anaconda</a> installation method is probably the easiest path forward.</p><h3>Build Things People Can Find</h3><h4>Start a Github Profile</h4><p>Why? = <strong>Because if you’re self-taught you need to show evidence you can create things and write actual code.</strong> The commonly acceptable way to do this is to give people a link to your github profile where you have a bunch of public code projects. These can be data visualizations, machine-learning baby-scale projects, whatever. Make sure not all of them are forks or class work where you followed instructions.</p><p>If you’re not familiar with the terms, <a href="https://www.coredna.com/blogs/what-is-git-and-github-part-two">here</a> are some definitions of <a href="https://en.wikipedia.org/wiki/Git">Git</a> and <a href="https://en.wikipedia.org/wiki/GitHub">Github</a>. There are other services than github you can use, like gitlab or bitbucket, but GitHub is the most common.</p><p>While on the topic of github, I will note that <a href="https://github.com/softwareunderground/awesome-open-geoscience">this repository of <em>“AWESOME OPEN GEOSCIENCE”</em></a> code projects is something to check out. It lives on github. It contains a wide variety of lesser known geoscience-domain-specific tools you can use. It started as a conversation I had with others in the <a href="https://softwareunderground.org/slack/">Software Undergound Slack channel</a>. It is one of the many “Awesome lists” out there for code in a specific domain or application area.</p><h4>Personal website</h4><p>Why? = Because it is good web programming practice and shows you can build something. Additionally, it can be a way to do personal branding.</p><p>The two easiest ways to do this are a <a href="https://wordpress.com/">WordPress</a> website or a <a href="https://pages.github.com/">github pages</a> website. Wordpress is a content management system or CMS. You technically don’t need to code at all if you use WordPress though you can do some small edits in HTML, CSS, and JavaScript if you’d like to. On the back-end side, WordPress runs PHP. Wordpress may cost you depending where it is hosted and whether you want a more professional web address. Github pages is free, but only front-end (<em>HTML,CSS,JS</em>), meaning no connection to a database or back-end scripts (<em>Python or PHP</em>). There are plenty of open-source, free static page templates you can use to get started with a <a href="https://pages.github.com/">github.io page</a>.</p><ul><li><a href="http://justingosses.com/">http://justingosses.com/</a> <strong>This is my personal website</strong>. I should probably update it, but sharing now just so you don’t only look at flawless ones and get discouraged. It is mostly WordPress.</li><li><a href="http://kbroman.org/simple_site/pages/user_site.html">http://kbroman.org/simple_site/pages/user_site.html</a> : This github.io page is a nice template that people can use and substitute in your own content.</li><li><a href="https://medium.com/@svinkle/publish-and-share-your-own-website-for-free-with-github-2eff049a1cb5">https://medium.com/@svinkle/publish-and-share-your-own-website-for-free-with-github-2eff049a1cb5</a>: Another tutorial.</li></ul><h3>Active In-person Learning</h3><h4>Tutorials at Tech Conferences</h4><p>Why? = Because they’re really good at getting as much of the information coming out of the firehose to go directly in your brain. They can also serve as starter material for a project on your github. Often the tutorials will be based around a library or a type of task. You’ll usually leave with a link to not just slides but also all the code the instructor ran, which sets you up to learn it even deeper later on. Conferences can be a good way to network too.</p><ul><li><a href="https://scipy2018.scipy.org/ehome/299527/648139/">SciPy Conference</a></li><li><a href="https://us.pycon.org/2018/schedule/tutorials/">PyCon</a></li></ul><h4>Hackathons</h4><p>Why? = Because hackathons are the fastest way to build things that demonstrate your ability to combine concepts and techniques to solve a real world problem. They’re also great for networking and learning new things through collaborative problem solving.</p><p>The factors that have differentiated good from less good hackathons in my limited experience were a length of at least 5 hours if not 2 days, interesting project ideas, project ideas scaled to the time and skillsets of participants, most participants knowing how to code at least a little, and enough coffee/food that you don’t have to leave.</p><p>Good Hackathons likely to be in Houston in the future:</p><ul><li><a href="https://events.agilescientific.com/">https://events.agilescientific.com/</a> : Agile runs several a year, typically around conferences that I can attest are quite good.</li><li><a href="http://houstonhackathon.com/">http://houstonhackathon.com/</a> : I’ve never been able to go myself as I’ve always been out of town, but I’m told its worth your time.</li></ul><h3>Single-Speaker-Style Meet-ups</h3><p>Why? = There’s a reason schools spend a lot of time filling peoples’ heads via the single-speaker at front of room format. It is generally effective.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*7IIC7vCHZJlh7FLtgNj8_A.jpeg" /><figcaption>Speaker at a meet-up hosted at Station Houston, a start-up incubator. Meet-ups are usually free events hosted by an organization that wants to help the tech community.</figcaption></figure><p>There are a variety of Houston meet-ups in the machine-learning, data science, python space. These meet-ups are almost always free. They vary in quality. Sometimes when they’re not good, it can be because they’ve turned into a vendor pitch or the content was different than what was listed. The houston energy data science meet-up sometimes falls into the trap of speakers being just a bit too vendor-ish, but usually it is okay. SPE (Society of Petroleum Engineers) sometimes has oil and gas data science “meet-ups”, but they aren’t free so I never go (<strong><em>Hint hint to anyone at SPE Gulf Coast Section</em></strong>).</p><ul><li><a href="https://www.meetup.com/Houston-Data-Science/">https://www.meetup.com/Houston-Data-Science/</a></li><li><a href="https://www.meetup.com/Houston-Energy-Data-Science-Meetup/">https://www.meetup.com/Houston-Energy-Data-Science-Meetup/</a></li></ul><h3>Non-Just-A-Speaker Meet-ups</h3><p>Why? = Because not all meet-ups are just a person talking and that’s a good thing. Some of them are more about doing.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/954/1*cU-USMLKOVoWxigAQGzWkA.jpeg" /><figcaption>Presentation of data visualizations created during a data jam at Houston Data Visualization Meet-up</figcaption></figure><p><a href="https://www.meetup.com/sketchcity/">Sketch city</a> regularly has people, local government agencies, and non-profits come in to share a bit about their open-data and what problems/solutions/visualizations/predictions a data-literate member of the public might make from their data. It is a good meet-up to attend for getting project ideas and networking within the local civic tech or civic-tech-interested crowd.</p><p><a href="http://www.meetup.com/Houston-Data-Visualization-Meetup/">The Houston Data Visualization Meet-up</a> (<em>disclaimer I help co-lead this one</em>) has both single-speaker format and data-jam format meetings. Data-jams are often on Saturday morning and consist of 10–30 people working in small groups to visualize a dataset they were just given that morning. Often these datasets come from a local community group or the city of Houston, though we’ve also used non-local datasets like ChemCam data from the Mars rover Curiosity or a dataset of Russion-bots’ posting on Twitter. In addition to being great starter projects for your portfolio and good networking, this type of meet-up exposes you to a wide variety of GUI and code library data visualization toolsets. You’ll find out what tools are good for what use cases.</p><ul><li><a href="https://www.meetup.com/sketchcity/">https://www.meetup.com/sketchcity/</a></li><li><a href="http://www.meetup.com/Houston-Data-Visualization-Meetup/">http://www.meetup.com/Houston-Data-Visualization-Meetup/</a></li></ul><h3>Filling Your Head Digitally</h3><p>Once you get a certain level of proficiency, learning will start to become more about keeping up and continuing to grow. The rate of “new” in data science greatly outstrips geology. It also occurs in different places. “New” in oil &amp; gas geology tends to mostly occur in yearly conferences, monthly or quarterly journal publications, new corporate best practice documents from on high, and major software updates. “New” in data science occurs in those places. It also occurs to a much larger extent on Slack, Twitter, Podcasts, and Medium articles. New techniques, new results, entirely new libraries are often announced via those methods before they are published in a journal or integrated into a GUI software application your organization might purchase. The flip side of using the methods below to ingest new data science content is the deluge can sometimes get overwhelming.</p><h3>Slack Communities</h3><p>Why? = Because your niche interest area may not overlap with the people you interact with on a daily basis. Even if it does, the number of people is going to be small. Slack is a way to expand that community discussion digitally. Slack is an asynchronous communication platform built around channels, which each have a different topic. It is similar to older chat programs but the user design works a lot better. The <a href="https://softwareunderground.org/slack/">softwareunderground</a> slack team is all about computing &amp; geoscience. Anyone can join. A few example channels are geospatial, houston, js, kaggle, open-geoscience, python, r-users, reading, and viz.</p><ul><li><a href="https://softwareunderground.org/slack/">https://softwareunderground.org/slack/</a> : You can join at this link. Cheers to the <a href="https://agilescientific.com/">Agile Scientific</a> group for setting it up.</li></ul><h3>Twitter</h3><p>Why? = Because if Slack, PodCasts, Medium, Journals, etc. all have a frequency, Twitter vibrates the fastest. Often things will all appear here first before they appear elsewhere. The girl who builds the crazy visualizations that inspire your next project. She’ll post drafts to Twitter. Someone recently discovered a rarely used but super useful function for your domain in a general purpose Python library. They’ll post about that to Twitter. Twitter isn’t just data science, of course. You’ll have to curate your feed by following people with good content, and that takes time, but it is an option for ingesting content at the cutting edge.</p><ul><li><a href="https://twitter.com/JustinGosses">@<strong>JustinGosses</strong></a></li><li><a href="https://twitter.com/GeostatsGuy"><strong>@</strong>GeostatsGuy</a></li><li><a href="https://twitter.com/leouieda"><strong>@</strong>leouieda</a></li><li><a href="https://twitter.com/landlabtoolkit">@landlabtoolkit</a></li><li><a href="https://twitter.com/fatiandoaterra">@fatiandoaterra</a></li><li><a href="https://twitter.com/PaulHCleverley">@PaulHCleverley</a></li><li><a href="https://twitter.com/vaex_io">@vaex_io</a></li><li><a href="https://twitter.com/ncasenmare">@ncasenmare</a></li><li><a href="https://twitter.com/jordansread">@jordansread</a></li></ul><h3>PodCasts</h3><p>Why? = Because data science isn’t just in text form.</p><ul><li><a href="https://dataskeptic.com/">DataSkeptic</a> : Data Science Explanations &amp; Discussion</li><li><a href="https://undersampledrad.io/">under sampled radio</a> : Geology + Computing</li></ul><h3>Medium</h3><p>Why? = Because getting a few things into your head via 5–30 minutes of reading is sometimes the exact right size of learning.</p><ul><li><a href="https://hackernoon.com/@kozyrkov?source=user_profile---------0---------------------">https://hackernoon.com/@kozyrkov</a> : <a href="https://hackernoon.com/@kozyrkov?source=user_profile---------0---------------------">Cassie Kozyrkov</a> Chief Data Intelligence Engineer at Google. She does a great job condensing down the subject matter into small useful bits of explanation you can use with other people without becoming fluffy like so many other pieces in Forbes or Business Insider that cover similar ground.</li><li><a href="https://medium.com/multiple-views-visualization-research-explained">https://medium.com/multiple-views-visualization-research-explained</a> : Explains data visualization research just like the name says. Written by a collection of academic data visualization researchers.</li><li><a href="https://medium.com/vis-gl">https://medium.com/vis-gl</a> : Uber’s data visualization group does some great stuff and open-sources a lot of it. This is a place to learn about new tools that combine JavaScript, data visualization, and geospatial.</li></ul><h3>LinkedIN</h3><p>Why? = Well to be honest, I’m not sure I get that much from LinkedIN, but it is good for finding out about small conferences or meetings with a data science focus that intersect with geology or oil &amp; gas. Both of these have mini-conferences or workshops that center on the intersection of oil &amp; gas and analytics.</p><ul><li><a href="https://www.linkedin.com/company/spe-gulf-coast-section/">https://www.linkedin.com/company/spe-gulf-coast-section/</a> : Society of Petroleum Engineers has an active data analytics section in Houston</li><li><a href="https://www.linkedin.com/company/houston-geological-society/">https://www.linkedin.com/company/houston-geological-society/</a> Houston Geological Society has an analytics mini-conference in Houston in 2019.</li><li><a href="https://www.linkedin.com/in/ricekenkennedy/detail/recent-activity/">https://www.linkedin.com/in/ricekenkennedy/detail/recent-activity/</a> Rice University has workshops and talks that might be of interest. I’ve previously presented remotely at Rice Data Science day, which has had some geology &amp; machine-learning talks.</li></ul><h3>Anything I missed that you really find helpful? Write a comment below or message me, and I’ll add it.</h3><h3>Good Luck!</h3><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5828ccc59be1" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>