Data Querying - Medium

Calling a Transformer ML Model directly via SQL to predict sentiments

Romain Rigaux — Sat, 07 Jan 2023 06:29:40 GMT

Tutorial on applying a Hugging Face Machine Learning model directly to some table data via SparkSql UDFs and MLflow

MLflow and Apache Spark shine for manipulating your data. Let’s focus on showing how an already existing ML model, here the popular distilbert, can be made available in SQL.

Democratize your ML: SQL is much simpler to use than regular Python, what if the model was easily available to your SQL user base?

Applying prediction directly on columns in a table

Here is the high level architecture:

Registering the model into MLflow then Spark

First we pull the model from the Hugging Face and register it in MLflow:

https://medium.com/media/5b525a3b07ebea995e0a581ea765cf58/href

Then via this Notebook we demo how to make the model available as a function that can directly be called in SQL queries.

https://medium.com/media/d411e36faa1593547c3c7434642dc65f/href

The code is available in this demo repository. Next time we will see how to build our own model!

And that’s it, happy predicting!

Calling a Transformer ML Model directly via SQL to predict sentiments was originally published in Data Querying on Medium, where people are continuing the conversation by highlighting and responding to this story.

Serving a Transformer model converting Text to SQL with Huggingface and MLflow

Romain Rigaux — Sun, 24 Oct 2021 05:23:01 GMT

As machine learning continues to mature, here is an intro on how to use a T5 model to generate SQL queries from text questions and serve it via a REST API.

Update: Follow-up post about using the Model directly in SQL

Machine Learning for code completion got a lot of press with the release of OpenAI Codex which powers GitHub Copilot. Many companies are tackling this problem and making progress is now quicker thanks to the better tooling and techniques.

In the 10 years of evolution of the Hue SQL Editor, investing and switching to a parser based autocomplete was one of the top three best decisions. The parsers have even being reused by most of the competitors. This was done five years ago and now new (complementary) approaches are worth investigating.

Starting the MLflow server and calling the model to generate a corresponding SQL query to the text question

Here are three SQL topics that could be simplified via ML:

Text to SQL →a text question get converted into an SQL query
SQL to Text →getting help on understanding what a SQL query is doing
Table Question Answering → literally ask questions on a grid dataset

Let’s have an intro with the generation of an SQL query from a text question.

For this we pick an existing model named dbernsohn/t5_wikisql_SQL2en.

Most of the difficult work has already been done by building the model and fine tuning it on the WikiSQL dataset.

Invocation of the prediction service REST API via curl

Let’s run the model with a simple question:

> python text2sql.py predict --query="How many people live in the USA?"

"SELECT COUNT Live FROM table WHERE Country = united states AND Name: text"

Bonus: this quick CLI based on a previous tutorial allows to interact easily with the model

Obviously the results are not pixel perfect and a lot more can be done but this is a good start. Now let’s see how serving the model as an API works:

Pulling a trained Text2SQL model M2 from Huggingface Hub and using MFlow to register it as experiments and serve them via a REST API

curl command asking the model to predict the SQL from a text question

For this we will use MLflow which provides a lot of the glue to automate the tedious engineering management of ML models.

https://medium.com/media/3e50adaac9c053e62ad28b5573d0d167/href https://medium.com/media/ccb77c71a1a5b4ea1286f69e3a4927a2/href https://medium.com/media/b96ad2d22b877c03f19d89bf3bbdc581/href

The API is simply local here but MLflow can automate the pushes and deploys of the models in production environments. In our case we just want to register it:

python text2sql.py train

And after starting the mlflow ui we can see the experiment:

Registering the small size model

Seeing some of the model metadata as well as how to load it. Note that more options like Schemas and registering in the Model Registry are available.

Now we select the iteration we want to serve:

mlflow models serve -m /home/romain/projects/romain/text2sql/mlruns/0/efec45c930714e3581033699e011df51/artifacts/model -p 5001

And then can directly query it!

curl -X POST -H "Content-Type:application/json; format=pandas-split" --data '{"columns":["text"],"data":[["How many people live in the USA?"]]}' http://127.0.0.1:5001/invocations

"SELECT COUNT Live FROM table WHERE Country = united states AND Name: text"

And that’s it!

The project is in a Github repo. As a follow-up you can also find a detailed exampled how to to manage a Bayesian Model with MLflow.

In the next episodes we will see how to integrate the ML API into your own SQL Editor and improve the model!

Serving a Transformer model converting Text to SQL with Huggingface and MLflow was originally published in Data Querying on Medium, where people are continuing the conversation by highlighting and responding to this story.

Hosting a Static Website

Romain Rigaux — Wed, 06 Oct 2021 17:51:50 GMT

In 2021, here are some quick and efficient solutions to perform the Hosting.

These days serving a basic website should not take much time out of your way. Note that what worked for me does not necessarily means that this is the best for you and vice versa!

Here is what I tried lately:

Kubernetes and Let’s Encrypt

It is what has been done for gethue.com and all its services like demo.gethue.com, docs.gethue.com, cdn, helm… It is overkill but a great way to understand how services can be operated and also 100% self contained (SSL included, seamless auto upgrade after a change) which is very handy.

gethue.com

Google Cloud Storage

Similar to the AWS S3 public hosting (lot of other solutions in AWS too) and looked easy to try despite a non intuitive to setup. But there is no way to get HTTPS simply or for free for a custom domain name, so I dropped it (but it is good for a CDN).

Netlify

Should be one of the easiest. Indeed, it was very quick to sign-up, then even just drag & drop the files and transfer the domain transfer. It is famous for integrating with Github. Probably what I would recommend for an open source website.

Firebase

I did not know about it but the Google Storage docs mention it as an alternative for easy support of HTTPS and custom domain name.

Serve dynamic content and host microservices using Firebase Hosting

And it was very easy to use:

firebase login
firebase projects:list
firebase init

And preview/deploy:

firebase emulators:start
firebase deploy

Next step will be to fully automate the push of the updates to the live Website. We will see how the Github Action performs in practice or re-poke at Netlify!

Hosting a Static Website was originally published in Data Querying on Medium, where people are continuing the conversation by highlighting and responding to this story.

To the Next Generation of Data Querying!

Romain Rigaux — Tue, 05 Oct 2021 04:29:04 GMT

After close to 10 years of evolution on the Hue project at Cloudera (and many team retreats over the world) I am joining Databricks to help make Querying Data Ubiquitous and Simple.

Interestingly, one of the first Spark SQL querying experience was pioneered in the early days with the Livy API in Hue and promoted in the first Spark Summits.

After all these years, we also got better at shipping robust core of SQL functionalities and developing software in a much more efficient way via automation, API and Components.

Hue SQL + Redash pave the way for a modern Querying.

Query Flow: Smarter Data Querying bridging SQL to ML

On top of this, AI matured to power the next Generation of Editors and Smart Autocompletes (e.g. Github Copilot, Tabnine…), and Data Warehousing can provide easy data access for training ML models and executing inferences via SQL itself.

The direction of Hue is still to be determined with regards to the new role and any feedback is welcomed!

The Hue logo

50% of Hue contributions over the years while growing the Team/Project

Passion is strong, only Experience can beat it, and now Passion + Experience should help deliver the next Level of Query Flow!

In the meantime, checkout the current DB SQL!

Happy Querying!

Onwards!

To the Next Generation of Data Querying! was originally published in Data Querying on Medium, where people are continuing the conversation by highlighting and responding to this story.

Spark Summit Europe: Building a REST Job Server for interactive Spark as a service

Romain Rigaux — Tue, 28 Sep 2021 17:59:26 GMT

Initially published on https://gethue.com/spark-summit-europe-building-a-rest-job-server-for-interactive-spark-as-a-service/ on 28 October 2015.

Building a REST Job Server for interactive Spark as a service

Livy is a new open source Spark REST Server for submitting and interacting with your Spark jobs from anywhere. Livy is conceptually based on the incredibly popular IPython/Jupyter, but implemented to better integrate into the Hadoop ecosystem with multi users. Spark can now be offered as a service to anyone in a simple way: Spark shells in Python or Scala can be ran by Livy in the cluster while the end user is manipulating them at his own convenience through a REST api. Regular non-interactive applications can also be submitted. The output of the jobs can be introspected and returned in a tabular format, which makes it visualizable in charts. Livy can point to a unique Spark cluster and create several contexts by users. With YARN impersonation, jobs will be executed with the actual permissions of the users submitting them. Livy also enables the development of Spark Notebook applications. Those are ideal for quickly doing interactive Spark visualizations and collaboration from a Web browser! This talk is technical and details the architecture and design decisions taken for developing this server, as well as its internals. It also describes the alternatives we tried and the challenges that were faced. The capabilities of Livy will then be lived demo in Hue’s Notebook Application through a real life scenario.

Examples:

Spark Summit Europe: Building a REST Job Server for interactive Spark as a service from gethue

Share on Facebook

Spark Summit Europe: Building a REST Job Server for interactive Spark as a service was originally published in Data Querying on Medium, where people are continuing the conversation by highlighting and responding to this story.

Hadoop / Spark Notebook and Livy REST Job Server improvements!

Romain Rigaux — Tue, 28 Sep 2021 17:57:25 GMT

Initially published on https://gethue.com/spark-notebook-and-livy-rest-job-server-improvements/ on 24 August 2015.

The Notebook application as well as the REST Spark Job Server are being revamped. These two components goals are to let users execute Spark in their browser or from anywhere. They are still in beta but next version of Hue will have them graduate. Here are a list of the improvements and a video demo:

Revamp of the snippets of the Notebook UI
Support for Spark 1.3, 1.4, 1.5
Impersonation with YARN
Support for R shell
Support for submitting jars or python apps

How to play with it?

See in this post how to use the Notebook UI and on this page on how to use the REST Spark Job Server named Livy. The architecture of Livy was recently detailed in a presentation at Big Data Scala by the Bay. Next updates will be at the Spark meetup before Strata NYC and Spark Summit in Amsterdam.

Slicker snippets interface

The snippets now have a new code editor, autocomplete and syntax highlighting. Shortcut links to HDFS paths and Hive tables have been added.

R support

The SparkR shell is now available, and plots can be displayed inline

Support for closing session and specifying Spark properties

All the spark-submit, spark-shell, pyspark, sparkR properties of jobs & shells can be added to the sessions of a Notebook. This will for example let you add files, modules and tweak the memory and number of executors.

So give this new Spark integration a try and feel free to send feedback on the hue-user list or @gethue!

Hadoop / Spark Notebook and Livy REST Job Server improvements! was originally published in Data Querying on Medium, where people are continuing the conversation by highlighting and responding to this story.

Performing automated upgrades of Services after a code change

Romain Rigaux — Fri, 24 Sep 2021 19:23:01 GMT

Efficient CICD by leveraging GitHub, DockerHub, Keel and webhooks.

This is a series of post describing how the Hue Query Service is being built.

Following-up on concept of “no downtime while upgrading”, scheduled daily refreshes are a good first step, but shortening up even more the development-release cycle feedback loop can provide an even better return on investment i.e.:

Did we introduce an evident bug in the latest code change?
Is the new functionality available right away to test/use for real?

This is obviously possible only if the building and testing of the artifact is fully automated and quick to build.

Note: one of the goal is to avoid a maximum of custom scripting and stay simple

After a Pull Request is merged, a new container is automatically built and will replace the currently running ones in our Kubernetes cluster with a Keel deployment:

From sending a code change to building the artifact and serving it

Docker Hub auto building feature

Note: Docker Hub auto build feature now requires to pay. Some other companies like Google Cloud still offers it for free

Out API pod freshly re-created with the new image

Caveat: auto rolling upgrades with versioning (instead of “latest” tag) are the way to go for a safe rollout in case of shipping a critical container

Many more options are described on https://keel.sh/.

What if we want something even lighter than above? We will look at some Serverless options in another episode!

Performing automated upgrades of Services after a code change was originally published in Data Querying on Medium, where people are continuing the conversation by highlighting and responding to this story.

Quickly Building a Command Line Interface for your Web Service

Romain Rigaux — Fri, 10 Sep 2021 22:08:33 GMT

Make your service more accessible and force good design principles.

A Command Line Interface (CLI) is the antithesis of a modern Web interface. The Hue Query Assistant already provides a visual way to Query Data and manipulate files, and is getting simpler and smarter as the releases goes.

So why providing a CLI?

In short, we found a CLI was:

Helpful for simplifying certain usage operations, has a very quick ROI and re-enforce clean designs and opens up more creativity.

Philosophy

The CLI targets more advanced users and provides direct access to the Query Service from their desktop or favorite machine as long as it can talk to it via HTTP.

Still the same interaction as via a Web browser but via a Bash terminal

The first goal is to augment the current API for its most important operations:

Execute an SQL statement or saved query
List, download, upload files

It was decided to focus on the new secure Storage API (handy to manipulate files from a shell) and not blindly support all the possible operations (skip the clutter) of the recent public REST API powering the SQL Scratchpad and the File Browsers.

The CLI leverages a lot of existing pieces and it only took 2 days to design/implement a first version with one operation. It is also straightforward for the Open Source Community and Hue Contributors to add extra operation by following the existing commands.

Last but not least, there was a lot of learning and inspiration cascading down from this first version. It particular on how to design and use Typer instead of the traditional argparse:

Alternatives, Inspiration and Comparisons

Typer provides exemplary documentation, is designed for simplicity, built on top of Click, leverages Python 3 types.

But now let’s give this CLI a quick try!

The CLI project is part of the Compose repository and automatically bundled into the Gethue package.

Let’s install the latest version:

pip install gethue

See the current commands:

> compose --help

Usage: compose [OPTIONS] COMMAND [ARGS]...

Query your Data Easily

Options:
--install-completion  Install completion for the current shell.
--show-completion     Show completion for the current shell, to copy it or customize the installation.
--help                Show this message and exit.

Commands:
auth     Configure the CLI
query    Execute queries, list databases, tables
storage  Manipulate data files

And point to the demo service API:

> compose auth

Api url [https://demo.gethue.com]:
Username [demo]:
Password [demo]:
Auth: success 200
Token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ0b2tlbl90eXBlIjoiYWNjZXNzIiwiZXhwIjoxNjMxMjE5MDkxLCJqdGkiOiJkNGJkY2Q5M2NjMjg0MDlkYWJlYWZhNGRlNjlkOTMzMyIsInVzZXJfaWQiOjJ9.Gr8bW_JaZ8yzQ3eEZYp3jKbdsSgLAXxqvSRbeU6jhLg

And list the content of a remote directory:

> compose storage list --path s3a://demo-gethue

s3a://demo-gethue/data (https://demo.gethue.com/hue/filebrowser/view=s3a%3A%2F%2Fdemo-gethue%2Fdata)

s3a://demo-gethue/data/web_logs (https://demo.gethue.com/hue/filebrowser/view=s3a%3A%2F%2Fdemo-gethue%2Fdata%2Fweb_log

Et voila!

The new CLI is paving the way for the Hue 5 Query Editor Service.

We also got new ideas along the way, like decoupling even more the Python modules, introducing design patterns from the Typer project, getting familiar with Python 3 typing… which already paid back the time spent on creating the CLI.

We bet that the user community will also come back with new usage feedback! (hint: like scheduling queries ;)

Quickly Building a Command Line Interface for your Web Service was originally published in Data Querying on Medium, where people are continuing the conversation by highlighting and responding to this story.

Seamless integration of a SQL Scratchpad component into your own Web app

Romain Rigaux — Tue, 24 Aug 2021 17:13:58 GMT

How to authenticate an external Web Component with your application.

Now that we have these decoupled and reusable SQL Web Components and REST API, how do we link them up together into a separate Web application?

This post describes strategies for having them interact properly, in particular about how to handle the authentication.

High level: authentication between Web Component and API and the Database to query

The SQL Scratchpad component is injected into a Web page. This Web page can either be served by Hue or be completely independent, e.g. we want to leverage the advanced SQL autocomplete and query execution of Hue from within another completely independent application (an existing Notebook app or a custom Popup functionality).

Same Authentication as the Hue API

i.e. This is the “traditional” authentication, same as signing in from the main Hue login page itself.

Hue authentication is supporting multiple auth backends (with some of them providing out of the box SSO like LDAP or SAML). When using the Web interface the browser currently forwards an HTTP cookie to the API. When using the public API, it is a JWT token.

This is pretty straightforward and brings us to the second strategy, where Hue is purely seen as an external SQL Editor service.

Authentication external to the Hue API

In the real world, the Web page displaying the SQL Scratchpad has already authenticated itself via another Authentication service (e.g. company SSO) and got a cookie or JWT identifying the logged-in user. Also we don’t want yet another login box showing-up in the component asking the user to authenticate.

Similarly to providing custom authentication login backends, Hue also supports providing your own authentication for the public REST API itself (thanks to Django REST Framework own pluggability).

For example a single page Web app can provide its own token to the SQL Editor component:

The Web page already authenticated with the Central Authentication service for a token, forward this token to the Scratchpad Component that will forward it to the Query Service API which can decode it, usually by leveraging a public API key

There are multiple ways to pickup and provide this JWT token. It depends how it is stored in the main application, which could be:

hardcoded
a cookie
in storage
in memory

In demo.gethue.com, the authentication is well, realistic only for a demo as the credentials are set in clear in the page:

One more realistic way is to provide the token to the Component via its setBearerToken() method (other hooks are currently in design).

Note: we are not discussing here the possible CSRF/XSS vulnerabilities of above methods as these are not specific to the Web component, but this is something to be aware of.

Some advantages of this method is to see the Query Service as its own “headless” entity and to simplify the interactions, as if the token can be validated and so trusted, it can also be forwarded between services. e.g. the Query API can forward the end user token to the Database engine.

The same token is used across the platform services

Interested in helping build better SQL components (Editor, Parsers, Formatter, APIs..)? Feel free to follow-up on the development section!

Seamless integration of a SQL Scratchpad component into your own Web app was originally published in Data Querying on Medium, where people are continuing the conversation by highlighting and responding to this story.

Object/File Storage public REST API

Romain Rigaux — Fri, 13 Aug 2021 17:17:21 GMT

Leverage a REST API to simplify your data files interactions like list, upload, download in the public object storage Clouds.

Same file operations as in the Web App available as REST API calls

This post comes with a live tutorial of the Hue file listing API via the demo environment demo.gethue.com.

Background: the Hue SQL Editor project has been evolving for more than 10 years and allows you to query any Database or Data Warehouse.

Recently: like previously described in the SQL Editor API post, all the end user functionalities and under the cover grunt work of integration can now be simply reused programmatically (freeing up time to let you focus on the data work itself instead).

The main use cases for the File API is to upload data and create an SQL Table on top of them or retrieve those pesky file URIs:

Quick Path copy or open file in the Create Table Wizard

The API leverages the standard credentials of your users (SSO via LDAP, SAML…) and is the same as if they were interacting via the Web UI directly. In bonus, it is cloud agnostic so nobody is required to learn about the intricacies of each provider, and simply use an interface they are already familiar with.

API Demo

The simplest operation is to list the content of your buckets or directories (aka known as “list dir”).

Start by authenticating and asking for an API access token (also known as JWT):

curl -X POST https://demo.gethue.com/api/token/auth -d 'username=demo&password=demo'

{"refresh":"eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ0b2tlbl90eXBlIjoicmVmcmVzaCIsImV4cCI6MTYyOTQ3MTE0MiwianRpIjoiYjNkMDUzN2I1OGU5NDNlZGE0OTJiYzVmOTkzMDEwOTEiLCJ1c2VyX2lkIjoyfQ._MXo09PzisvqY7-1NMVIaLiUCVksYx2ZA5v_PWTk0TY","access":"eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ0b2tlbl90eXBlIjoiYWNjZXNzIiwiZXhwIjoxNjI4OTUyNzQyLCJqdGkiOiJkYTEzZjI2OWY2N2M0MTNiODNiNGYwNzY1ZDA3NzdmMCIsInVzZXJfaWQiOjJ9.47gnDdIwVSo_cULXU856WUgW8FW7UHXMg7FH-dDpoRc"}

Then provide this access value in each following calls. In your case, update the examples below with your own:

Authorization: Bearer

Here is how to list the content of a path, here the S3 bucket s3a://demo-gethue:

curl -X GET https://demo.gethue.com/api/storage/view=s3a://demo-gethue -H "Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ0b2tlbl90eXBlIjoiYWNjZXNzIiwiZXhwIjoxNjI4OTUyNzQyLCJqdGkiOiJkYTEzZjI2OWY2N2M0MTNiODNiNGYwNzY1ZDA3NzdmMCIsInVzZXJfaWQiOjJ9.47gnDdIwVSo_cULXU856WUgW8FW7UHXMg7FH-dDpoRc"

{
  "path": "s3a://demo-gethue",
  "breadcrumbs": [
    {
      "url": "s3a%3A%2F%2F",
      "label": "s3a://"
    },
    {
      "url": "s3a%3A%2F%2Fdemo-gethue",
      "label": "demo-gethue"
    }
  ],
  "current_request_path": "/filebrowser/view=s3a%3A%2F%2Fdemo-gethue",
  "is_trash_enabled": false,
  "files": [
    {
      "path": "s3a://",
      "name": "..",
      "stats": {
        "path": "s3a://",
        "size": 0,
        "atime": null,
        "mtime": null,
        "mode": 16895,
        "user": "",
        "group": "",
        "aclBit": false
      },
      "mtime": "",
      "humansize": "0 bytes",
      "type": "dir",
      "rwx": "drwxrwxrwx",
      "mode": "40777",
      "url": "/filebrowser/view=s3a%3A%2F%2F",
      "is_sentry_managed": false
    },
    {
      "path": "s3a://demo-gethue",
      "name": ".",
      "stats": {
        "path": "s3a://demo-gethue",
        "size": 0,
        "atime": 1628866612,
        "mtime": 1628866612,
        "mode": 16895,
        "user": "",
        "group": "",
        "aclBit": false
      },
      "mtime": "August 13, 2021 02:56 PM",
      "humansize": "0 bytes",
      "type": "dir",
      "rwx": "drwxrwxrwx",
      "mode": "40777",
      "url": "/filebrowser/view=s3a%3A%2F%2Fdemo-gethue",
      "is_sentry_managed": false
    },
    {
      "path": "s3a://demo-gethue/data",
      "name": "data",
      "stats": {
        "path": "s3a://demo-gethue/data/",
        "size": 0,
        "atime": null,
        "mtime": null,
        "mode": 16895,
        "user": "",
        "group": "",
        "aclBit": false
      },
      "mtime": "",
      "humansize": "0 bytes",
      "type": "dir",
      "rwx": "drwxrwxrwx",
      "mode": "40777",
      "url": "/filebrowser/view=s3a%3A%2F%2Fdemo-gethue%2Fdata",
      "is_sentry_managed": false
    }
  ],
  "page": {
    "number": 1,
    "num_pages": 1,
    "previous_page_number": 0,
    "next_page_number": 0,
    "start_index": 1,
    "end_index": 1,
    "total_count": 1
  },
  "pagesize": 30,
  "home_directory": null,
  "descending": null,
  "cwd_set": true,
  "file_filter": "any",
  "current_dir_path": "s3a://demo-gethue",
  "is_fs_superuser": false,
  "groups": [],
  "users": [],
  "superuser": null,
  "supergroup": null,
  "is_sentry_managed": false,
  "apps": [
    "filebrowser",
    "metastore",
    "useradmin",
    "indexer",
    "notebook"
  ],
  "show_download_button": true,
  "show_upload_button": true,
  "is_embeddable": false,
  "s3_listing_not_allowed": ""
}

Some of the parameters:

pagesize=45 (number of items to return)
pagenum=1 (pagination)
filter=file names text to match, can be empty
sortby=name (field to use for sorting)
descending=false (keep sorting alphabetical)

e.g. pagesize=45&pagenum=1&filter=&sortby=name&descending=false

Then peek at the data of the s3a://demo-gethue/data/web_logs/index_data.csv file:

curl -X GET https://demo.gethue.com/api/storage/view=s3a://demo-gethue/data/web_logs/index_data.csv -H "Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ0b2tlbl90eXBlIjoiYWNjZXNzIiwiZXhwIjoxNjI4OTUyNzQyLCJqdGkiOiJkYTEzZjI2OWY2N2M0MTNiODNiNGYwNzY1ZDA3NzdmMCIsInVzZXJfaWQiOjJ9.47gnDdIwVSo_cULXU856WUgW8FW7UHXMg7FH-dDpoRc"

{
  "show_download_button": true,
  "is_embeddable": false,
  "editable": false,
  "mtime": "October 31, 2016 03:34 PM",
  "rwx": "-rw-rw-rw-",
  "path": "s3a://demo-gethue/data/web_logs/index_data.csv",
  "stats": {
  "size": 6199593,
  "aclBit": false,
  ...............
  "contents": "code,protocol,request,app,user_agent_major,region_code,country_code,id,city,subapp,latitude,method,client_ip,  user_agent_family,bytes,referer,country_name,extension,url,os_major,longitude,device_family,record,user_agent,time,os_family,country_code3
    200,HTTP/1.1,GET /metastore/table/default/sample_07 HTTP/1.1,metastore,,00,SG,8836e6ce-9a21-449f-a372-9e57641389b3,Singapore,table,1.2931000000000097,GET,128.199.234.236,Other,1041,-,Singapore,,/metastore/table/default/sample_07,,103.85579999999999,Other,"demo.gethue.com:80 128.199.234.236 - - [04/May/2014:06:35:49 +0000] ""GET /metastore/table/default/sample_07 HTTP/1.1"" 200 1041 ""-"" ""Mozilla/5.0 (compatible; phpservermon/3.0.1; +http://www.phpservermonitor.org)""
    ",Mozilla/5.0 (compatible; phpservermon/3.0.1; +http://www.phpservermonitor.org),2014-05-04T06:35:49Z,Other,SGP
    200,HTTP/1.1,GET /metastore/table/default/sample_07 HTTP/1.1,metastore,,00,SG,6ddf6e38-7b83-423c-8873-39842dca2dbb,Singapore,table,1.2931000000000097,GET,128.199.234.236,Other,1041,-,Singapore,,/metastore/table/default/sample_07,,103.85579999999999,Other,"demo.gethue.com:80 128.199.234.236 - - [04/May/2014:06:35:50 +0000] ""GET /metastore/table/default/sample_07 HTTP/1.1"" 200 1041 ""-"" ""Mozilla/5.0 (compatible; phpservermon/3.0.1; +http://www.phpservermonitor.org)""
    ",Mozilla/5.0 (compatible; phpservermon/3.0.1; +http://www.phpservermonitor.org),2014-05-04T06:35:50Z,Other,SGP
  ...............
}

Some of the parameters:

offset=0
length=204800
compression=none
mode=text

e.g. ?offset=0&length=204800&compression=none&mode=text

And then decide to download it:

curl -X GET https://demo.gethue.com/api/storage/download=s3a://demo-gethue/data/web_logs/index_data.csv -H "Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ0b2tlbl90eXBlIjoiYWNjZXNzIiwiZXhwIjoxNjI4OTUyNzQyLCJqdGkiOiJkYTEzZjI2OWY2N2M0MTNiODNiNGYwNzY1ZDA3NzdmMCIsInVzZXJfaWQiOjJ9.47gnDdIwVSo_cULXU856WUgW8FW7UHXMg7FH-dDpoRc"

It is also possible to upload your data directly (if you have the proper write permissions in the remote destination folder).

Here we send the local file README.md to the remotes3a://demo-gethue/web_log_data/ directory:

curl -X POST https://demo.gethue.com/api/storage/upload/file?dest=s3a://demo-gethue/web_log_data/ --form hdfs_file=@README.md

Note: the hdfs_file parameter is a relative or absolute path to a local file. The name is confusing currently, it should be read more like local_file (i.e. not related to HDFS only)

Then what?

When the data is stored in the cloud, it becomes easy to create a SQL table and query it. One way it to open up the File Browser and copy the path of the data into a CREATE TABLE statement or just go via the Create table wizard which will do all the work for you.

Note that small data files don’t even need to go via the cloud storage and can be directly uploaded via drag & drop in the Web interface or Importer API. Something that will be demoed next time, so stay tuned!

Directly uploading a file and getting a SQL table ready to query

Proper security

It is also a good timing. The file listing (for HDFS, the Hadoop file system) has be present since day one. Later on AWS S3, Azure Storage, Google Cloud Storage (beta) have been added but were lacking fine grained security (i.e. all the users were using the same credentials, so not good).

This is not true anymore as recently the shared signed URL technology of these cloud storages is being leveraged under the hood to have each user perform file operations under their own distinct credentials. This allows true self service instead of restricting data uploads to only admin. Users can be trusted and upload their own files and analyze them without contacting anybody else. Another bottleneck removed!

If interested in more technical details, read more about AWS Shared Signature or Azure Signed URLs.

Hue or Compose app contacting a middleware service that converts raw calls to object storages into custom signed URLs in order to provide fine grained authorization

Sum-up

Now there is no excuses to not be data driven and provide self service analytics to your hungry users ;)

Using GCP or other storages? Let us know!

And in case you missed it, the coolest API is actually the Execute a SQL query, play with it!

Onwards!

Romain

Object/File Storage public REST API was originally published in Data Querying on Medium, where people are continuing the conversation by highlighting and responding to this story.