Skip to content

Getting started with DPL data

This guide explains how to access Parse.ly Data Pipeline data, including the publicly available demo data and customer-specific Data Pipeline data.

Watch the Data Pipeline getting started video

Download Data Pipeline data using AWS CLI

Setting up AWS CLI locally is simple. Follow the AWS CLI installation instructions.

Set up credentials to access a private Data Pipeline S3 bucket. This step is not necessary for the demo data public S3 bucket.

aws configure --profile parsely_dpl
AWS Access Key ID [None]: ENTER ACCESS ID
AWS Secret Access Key [None]: ENTER SECRET KEY
Default region name [None]: us-east-1
Default output format [None]: json

Download one file

Download files from a Parse.ly Data Pipeline S3 bucket or from the public Parse.ly S3 bucket with demo Data Pipeline data.

For the demo Parse.ly Data Pipeline data

aws --no-sign-request s3 cp s3://parsely-dw-parse-ly-demo events/file_name.gz

For a private customer-specific S3 bucket Make sure to use the profile flag

aws s3 cp s3://parsely-dw-bucket-name-here events/file_name.gz --profile parsely_dpl

Download all files

Download all files in an S3 bucket using the commands below. This may involve a large amount of data.

For the demo Parse.ly Data Pipeline data

aws --no-sign-request s3 cp s3://parsely-dw-parse-ly-demo . --recursive

For a private customer-specific S3 bucket Make sure to use the profile flag

aws s3 cp s3://parsely-dw-bucket-name-here . --recursive --profile parsely_dpl

Copy the data to an S3 bucket

AWS CLI provides a simple method to copy data to an S3 bucket locally.

For the demo Parse.ly Data Pipeline data

aws s3 --no-sign-request cp s3://parsely-dw-parse-ly-demo s3://your-bucket-here --recursive

For a private customer-specific S3 bucket Make sure to use the profile flag

aws s3 cp s3://parsely-dw-bucket-name-here s3://your-bucket-here --recursive --profile parsely_dpl

Copy Data Pipeline data to Redshift or Google BigQuery

Parse.ly provides a GitHub repository for these use cases.

The README.md in the repository linked above contains detailed installation and usage instructions. The following examples demonstrate common tasks after installing the parsely_raw_data repository.

Copy S3 data to a Redshift database

This command creates an Amazon Redshift table using the specified Parse.ly schema and loads the Data Pipeline data into the new table.

python -m parsely_raw_data.redshift

Copy S3 data to Google BigQuery

This command creates a Google BigQuery table using the specified Parse.ly schema and loads the Data Pipeline data into the new table.

python -m parsely_raw_data.bigquery

Query Data Pipeline data using AWS Athena

AWS Athena provides a SQL interface to query S3 files directly without moving data.

  1. Create an Athena table using the Parse.ly Data Pipeline Athena schema
  2. Load the data into the recommended year-month partitions:
    ALTER TABLE table_name_here ADD PARTITION (year='YYYY', month='MM') location 's3://parsely-dw-bucket-name-here/events/YYYY/MM' 
  3. Use Athena to query the Data Pipeline data Screenshot of AWS Athena query interface showing Data Pipeline query results

Getting started queries to answer common questions

These queries are formatted for use with Athena to query the Data Pipeline data.

Retrieve all records

This query retrieves all records from the Athena table that reads from S3 files. This query retrieves only loaded partitions (see section above). More specific partitions reduce Athena query costs.

select * from parsely_data_pipeline_table_name

Bot traffic investigation

Bot traffic continues to evolve. Investigate the user agent and IP address for a specific post on a certain day using the following query as a template.

select
  user_agent,
  visitor_ip,
  count(action) as pageviews
from parsely_data_pipeline_table_name
where
  year = 'yyyy' and --this makes the query cheaper!
  month = 'mm' and --this makes the query cheaper!
  action = 'pageview' and
  url like '%only-include-unique-url-path-here%' and
  date(ts_action) = 'yyyy-mm-dd'

Engaged-time by referrer type

This is a template query to retrieve engaged time by referrer category.

select
  channel,
  ref_category,
  sum(engaged_time_inc) as engaged_time_seconds,
  sum(engaged_time_inc)/60 as engaged_time_minutes
from parsely_data_pipeline_table_name
where
  year = 'yyyy' and
  month = 'mm'
group by 1,2
order by 3 desc

View conversions

Conversions data is included in Data Pipeline data. Query conversions data using the following template.

select
  *
from parsely_data_pipeline_table_name
where
  year = 'yyyy' and --this makes the query cheaper!
  month = 'mm' and --this makes the query cheaper!
  action = 'conversion'

Use dbt and a pre-formatted star schema to organize Data Pipeline data in Redshift

The dbt (data build tool) automates SQL table creation and data pipeline management for Parse.ly data. It generates queryable tables for page views, sessions, loyalty users, subscribers, engagement levels, and read time. The tool handles incremental loading of new data from S3 to SQL tables, reducing configuration time and enabling faster custom query development.

More information is available in the Parse.ly dbt Redshift repository.

How to get started

  • Install dbt and requirements from the main /dbt/ folder one level up: pip install -r requirements.txt
  • Edit the following files:
    • ~/.dbt/profiles.yml: Input profile, Redshift cluster, and database information. Refer to the dbt profile configuration documentation.
    • settings/default.py: This is the one stop shop for all parameters that need to be configured.
  • Test the configuration by running python -m redshift_etl. A fully updated settings/default.py file requires no additional parameters. Arguments provided at runtime override settings in default.py.
  • Schedule redshift_etl.py to run on an automated schedule. Daily runs are recommended.

Schemas/models

  • Users Table Grain: One row per unique user ID based on IP address and cookie. This table provides Parse.ly Data Pipeline lifetime engagement data for each user, including loyalty and rolling 30-day loyalty classification.
  • Sessions Table Grain: One row per user session. A session represents any activity by one user without being idle for more than 30 minutes. The session table includes total engagement and page view metrics for the entire session, as well as the user types at the time of the session. This enables simplified identification of conversions into loyalty users and subscribers.
  • Content Table Grain: One row per article or video. This table contains only the most recent metadata for each article or video and enables simplified reporting and aggregation when metadata changes throughout the article’s lifetime.
  • Campaigns Table Grain: One row per campaign. This table contains only the most recent description for each campaign.
  • Pageviews Table Grain: One row per page view. This table contains the referrer, campaign, timestamps, engaged time, and at-time-of-engagement metadata for each page view. The page views are organized to show the order and flow of page views within a session for a single user.
  • Videoviews Table Grain: One row per videoview. This table contains the referrer, campaign, timestamps, engaged time, and at-time-of-engagement metadata for each video view. The video views are organized to show the order and flow of video views within a session for a single user.
  • Custom events Table Grain: One row per custom event sent through the Parse.ly Data Pipeline. This is any event that is not: pageview, heartbeat, videostart, or vheartbeat. These can be specified in the dbt_project.yml file and contain keys to join to users, sessions, content, and campaigns.

Last updated: December 24, 2025