Skip to main content
r/datascience icon

r/datascience doge doge

Wiki has been Updated!

members
online

What would you do with this task, and how long would it take you to do it? What would you do with this task, and how long would it take you to do it?
Analysis

I'm going to describe a situation as specifically as I can. I am curious what people would do in this situation, I worry that I complicate things for myself. I'm describing the whole task as it was described to me and then as I discovered it.

Ultimately, I'm here to ask you, what do you do, and how long does it take you to do it?

I started a new role this month, I am new to advertising modeling methods like mmm, so I am reading a lot about how to apply the methods specific to mmm in R and python, I use VScode, I don't have a github copilot license, I get to use copilot through windows office license. Although this task did not involve modeling, I do want to ask about that kind of task another day if this goes over well.

The task

5, excel sheets are to be provided. You are told that this is a clients data that was given to another party for some other analysis and augmentation. This is a quality assurance task. The previous process was as follows;

the data
  • the data structure: 1 workbook per industry for 5 industries

  • 4 workbooks had 1 tab, 1 workbook had 3 tabs

  • each tab had a table that had a date column in days, 2 categorical columns advertising_partner, line_of_business and at least 2 numeric columns per work book.

  • some times data is updated from our side and the partner has to redownload the data and reprocess and share again

the process
  • this is done once per client, per quarter (but it's just this client for now)

  • open each workbook

  • navigate to each tab

  • the data is in a "controllable" table

bing bing
home home
impressions spend partner dropdown line of business dropdown
  • where bing and home are controlled with drop down toggles, with a combination of 3-4 categories each.

  • compare with data that is to be downloaded from a tableau dashboard

  • end state: the comparison of the metrics in tableau to the excel tables to ensure that "the numbers are the same"

  • the categories presented map 1 to 1 with the data you have downloaded from tableau

  • aggregate the data in a pivot table, select the matching categories, make sure the values match

additional info about the file

  • the summary table is a complicated sumproduct look up table against an extremely wide table hidden to the left. the summary table can start as early as AK and as late as FE.

  • there are 2 broadly different formats of underlying data in the 5 notebooks, with small structure differences between the group of 3.

in the group of 3
  • the structure of this wide table is similar to the summary table with categories in the column headers describing the metric below it. but with additional categories like region, which is the same value for every column header. 1 of these tables has 1 more header category than the other 2

  • the left most columns have 1 category each, there are 3 date columns for day, quarter.

REGION USA USA USA
PARTNER bing bing google
LOB home home auto
impressions spend ...etc
date quarter impressions spend ...etc
2023-01-01 q1 1 2 ...etc
2023-01-02 q1 3 4 ...etc
in the group of 2
  • the left most categories are actually the categorical headers in the group of 3, and the metrics, the values in each category mach

  • the dates are now the headers of this very wide table

  • the header labels are separated from the start of the values by 1 column

  • there is an empty row immediately below the final row for column headers.

date Label 2023-01-01 2023-01-02
year 2023 2023
quarter q1 q1
blank row
REGION PARTNER LOB measure
blank row
US bing home impressions 1 3
US bing home spend 2 4
US google auto ...etc ...etc ... etc

The question is, what do you do, and how long does it take you to do it?

I am being honest here, I wrote out this explaination basically in the order in which I was introduced to the information and how I discovered it. (Oh it's easy if it's all the same format even if it's weird, oh there are 2-ish different formatted files)

the meeting of this task ended at 11:00AM. I saw this copy paste manual etl project and I simply didn't want to do it. So I outlined my task by identifying the elements of the table, column name ranges, value ranges, stacked / pivoted column ranges, etc... for an R script to extract that data. by passing the ranges of that content to an argument make_clean_table(left_columns="B4:E4", header_dims=c(..etc)) and functions that extract that convert that excel range into the correct position in the table to extract that element. Then the data was transformed to create a tidy long table.

the function gets passed once per notebook extracting the data from each worksheet, building a single table with the columns for the workbook industry, the category in the tab, partner, line of business, spend, impressions, etc...

IMO; ideally (if I have to check their data in excel that is), I'd like the partner to redo their report so that I received a workbook with the underlying data in a traditionally tabular form and their reporting page to use power query and table references and not cell ranges and formula.


Advance your skills in artificial intelligence with training from Google. Learn to think critically and use AI to improve operational efficiency and security.
Image Advance your skills in artificial intelligence with training from Google. Learn to think critically and use AI to improve operational efficiency and security.