From the course: Data Pipeline Automation with GitHub Actions Using R and Python

API limitations with R

- [Instructor] In the previous video, we saw how we can set get requests to pull data from the API using the AI get function. In this video we will review the output and explore some of the limitation of the get request. Let's go back to the data form we pulled during the previous lessons. DF one, recall the data frame has 5,000 rows or observations and seven variables. We added also, the index variable is the eight variable when we reformatted the period variable into a post six object or time object. Let's plot the data using the plotly function from the Plotly library. So let's feed the the, let's set the functions we want to plot DF one, we set X as index and Y as value. And since we want to have a line chart, let's set the type as cutter and the mode to lines. Let's go ahead and run it. As you can notice in the time series plot, there are some weird lines. Do not fit the serious pattern. We can go ahead and Zoom and explore those points. So for example, over here you can see those lines. You can also see it over here and some other places. So let's zoom in over here. You can see that some of the observation are missing. So for example, between November 4th, 2018 and September 13th, there are no points. Similarly over here from November 5th, 2018 and till January 12th. There are no any observations. If you keep exploring more dense area, you can see the similar patterns. So for example, if we just open it over here and you can see there are some buckets that are missing values. The reason that we got those missing values is related to the API 5,000 observation limit per Git request. If we were trying to pull five years of hourly time series data, this is more than 40,000 observation and we cannot pull it in a single request. One way we can address it is by setting a time range. By bounding the get request by specific time, which is aligned with the 5,000 observation limitation, we can use the start and end arguments and define this range. Let's go ahead and do it. On the next example, we are going to use the exact same parameters, but this time we are going to pull data between January 1st, 2024 and February 24th, 2024. So I'm going to set those variables over here and I'm going to add to the function with the same arguments, just adding those two variable. Let's go ahead and run the function and we go into reformat again, as we did before. We're going to set the index as a time object and arrange the data by the index. So you can see that we got, again, a time series in the same format. We can go ahead and check the number of observation and roll DF two. And now we got the 1,297 observation for this period. And we can go ahead and plot the data. And as you can see now, the series looks fine. There's no any issues that are popping up when you eyeball the series. We'll dive into more details about monitoring the data output and identify missing values and other problems in the next chapters. While you can use the I get function to pull a large dataset looping manually over the timestamp of the series, it could be very tedious to run it manually. This is where DA backfill function comes into place, enabling us to pull large datasets beyond the API limitation. In the next video, we'll re pull the series this time using the A backfill function.

Contents