Dealing with data sets that are too big

24 February 2023

If you’ve found this page you’ve likely got stuck trying to load a file into an Excel spreadsheet and been hit with the “This data set is too large for the Excel grid. If you save this workbook, you’ll lose data that wasn’t loaded” message.

What does that mean?

Excel is telling you that you have too much data to fit into a regular spreadsheet. If you carry on trying to use it – or accidentally save over it – you’ll be deleting data.

In this guide you’ll learn:

How to open huge files and get what you need from the data

How to quickly summarise and group data – like you would in a pivot table

How to join long lists of values to your data using a join – like you would with a vlookup

How much is too much data? The hard limits of MS Excel and Google Sheets

Limits vary depending on software. The Google Sheets limit is 18,278 columns wide (column ZZZ) or whenever you’ve used 10 million cells. Converting from Excel or CSV to a Google sheet would automatically cut off anything over these limits.

To trigger the ‘too big for the grid’ warning in Excel you’ll have a file that is more than 1,048,576 rows deep and/or more than 16,384 columns wide. Older .xls files from the 97-2003 era have a much lower limit at 65,536 rows tall by 256 columns wide.

It’s also worth noting that there are limits within limits. Cells have limits on how many characters they can contain, with shorter lengths allowed for column headers. There’s even a 65.5k limit to how many links you can put in a sheet.

Some limits are memory-based, like the number of tabs you can have. Although just because your machine can open a 300 tab document doesn’t mean you should.

Using R for quick analysis

The spreadsheet alternative we’ll be covering here is some quick data manipulation using R. This is a relatively straightforward language for maths and charting that any analyst can learn to use.

R makes statistical computing simple to do on normal machines without special hardware. Reading, summarizing and charting huge amounts of data should be a breeze for anyone.

If you’ve never come across R before it’ll only take a few minutes to install the base language and then install RStudio to make it easier to see what you’re working with.

The challenge:

In this example scenario we’re going to try and answer a straightforward sales question. Looking at nearly 2.3 million sales records for an eCommerce site, which area of the UK has spent the most money?

I’ll be using fictional sales data exported from a CMS to illustrate how we can join, summarise and export millions of rows with relatively few lines of code.

In this scenario the sales data and the customer’s personal details are kept in different places for security reasons. We’ll need to find a way to attach postcode areas to our sales data.

Our first step is getting the data into R so we can look at it.

How to read CSV’s in R:

Reading in large text files can be done in R in one simple command, using the logically named “read.csv” function.

By default you can just point it at your file location and it’ll get the basics right automatically. There are more advanced options you can use to specify whether you want to include a header row, what sort of separator you’re using (commas, tabs etc.), how to handle quotes and other more complex things.

For this exercise we have two data sets to read in; a long list of sales figures in a CSV called raw_sales_data.csv and a different sheet called region_data.csv. In this examples both files are in a user’s Downloads folder, so we specify this as the filepath. We also want to keep each data set and name it something sensible, we can do this using “<-” to assign the names. The code looks like this:

sales_data <- read.csv("C:/Users/Admin/Downloads/raw_sales_data.csv")
region_data <- read.csv("C:/Users/Admin/Downloads/customer_ids.csv")

We now have two data frames named sales_data and region_data. We can also see that sales_data has over 2.3 million observations (or rows) which would have been impossible to load in Excel.

It’s also possible to preview the data by clicking on a data frame, which is useful for sense checking that you have the right data. Here’s what a preview of the sales data looks like – sales by date broken down to a per-product basis:

To get a set of top level statistics by postal region we’re going to have to aggregate these sales figures. Before we can do that though, the next step is to join our regional data to our sales data.

Our regional data currently just a simple list of customers and their postcode areas, which looks like this:

How to do a left join:

If this was a spreadsheet a likely step you’d make would be to do a VLOOKUP (or XLOOKUP if you’re feeling fancy), using the customer_id as the lookup value.

We’ll essentially be doing the same thing, but using a function called a left_join that is common in computing languages (such as SQL).

A left join returns all records from the left table and brings in any matching records from the right table, based on a common value. Thinking of data as left or right is initially a little strange, but this is the accepted name people use. There is the assumption that the first table is on the left when reading from left to right.

The diagram below shows six kinds of join, demonstrating the different ways in which it’s possible to match up data tables. (Wikipedia has a detailed list of joins and how they work)

Joining In R

To do this with datasets in R, the function goes left_join(first table, second table, by = variables in common). In our example we want to join the region and sales data together using the customer ID, and save this as a new data frame called something sensible like “joined_data”. This translates to the following code:

joined_data <- left_join(region_data, sales_data, by = c(“customer_id”=”customer_id”))

If you’ve just tried that and got a “could not find function” error, that’s because the left_join function is part of a data manipulation package called dplyr that needs to be loaded. Packages are useful additions to the base language of R, dplyr is an extra set of tools that make it super easy to manipulate data.

If this is your first time using R and you don’t yet have dplyr, loading it into your current session takes two commands. One to install, and another to load the library:

install.packages("dplyr")
library(dplyr)

We now have a third data frame called “joined_data”, that has the same number of observations (rows) as our original sales_data but with one extra variable (column).

Previewing the data, we can see that the postcode_area has now been added into our sales data, which means we can now get some insights on sales by location.

How to group and summarise in R

If this were excel, your likely next step would be to throw all this data into a pivot table and summarise how much revenue there was in each postcode_area. We’re going to do the same thing here, but in a similar way to how it’s done with an SQL database.

The code looks like this:

summary_data <- joined_data %>%
group_by(postcode_area) %>%
summarise(total_orders = n_distinct(order_number), total_revenue = sum(order_value)) %>%
arrange(desc(total_revenue))

Here we’re using an operator called a pipe (%>%) to carry out a series of transformations all in one go, and create our final data frame called summary_data.

This is a breakdown of what we just asked R to do:

Take our joined_data table

Group all the data by post_code area (e.g. group all the rows that have “Leeds” together)

Create a new column called total_orders which counts the number of distinct (unique) order numbers, to tell us how many orders each place had

Create another column called total_revenue which sums up all the figures in the order_value column

Arrange this data by areas that have the most total_revenue, highest to lowest

Put the results of all of this into a new data frame and call it summary_data

What we now have is a new data frame of 125 different areas that this fictional company shipped to, and it looks like this:

In this scenario, London regions have driven the most revenue. Belfast, Brighton and Glasgow are three notable areas outside of the English capital.

Success! We’ve distilled 2.3 million sales records into 125 postcode areas. The final step is to export the data back into a format we can use elsewhere.

Exporting data frames as a CSV in R

Exporting data to a CSV is done using almost the same command as reading it in, but instead of “read.csv” we use “write.csv”. This is as simple as stating which data you want to write, and what you want the file to be called. For this example the code looks like this:

write.csv(summary_data, "sales_by_postcode_area.csv", row.names = FALSE)

The additional ‘row.names’ here just specifies that I’d rather not have a row number included in the export. Your exports will just go to wherever your current working directory is, unless you specify a full file path.

Summary

By this point you’ll hopefully be able to carry out a similar exercise with data sets of your own. Why not practice joining and summarizing large data sets? R is a good place to start learning how to analyze larger volumes of data, without needing a data warehouse.

All of the code used in this example is here:

library(dplyr)

sales_data <- read.csv("raw_sales_data.csv")
region_data <- read.csv("customer_ids.csv")

# Join both sets by customer_id
joined_data <- left_join(region_data, sales_data, by = c("customer_id"="customer_id"))

# Summarise by region
summary_data <- joined_data %>%
group_by(postcode_area) %>%
summarise(total_orders = n_distinct(order_number), total_revenue = sum(order_value)) %>%
arrange(desc(total_revenue))

# Write data to file
write.csv(summary_data, "sales_by_postcode_area.csv", row.names = FALSE)

If you’re curious to know more about what you can do with R, datacamp has a variety of free R courses available.

Final Thought

For some, going through an entire career and never straying from Excel or Google Sheets will be absolutely fine. We’re not suggesting your accounting team needs a BigQuery subscription by any means.

That said, don’t expect that aging it-deletes-everything-I-swear-it-hates-me laptop to go far if you need to scale up to handle something big.

If you’re someone with ‘analyst’ in your title there will be occasions where you’ll be working with large data sets. In an era where ‘data warehouse’ is no longer a dirty word, every analyst should be able to read and manipulate millions of rows of data to find the secrets hidden within.

A Real World Example

There are costly – sometimes dangerous – consequences for people who don’t know how to handle large data sets.

Some may remember when Public Health England lost 16,000 COVID-19 test results in October 2020. A mistake which led to 50,000 potentially infectious people going un-contacted while hundreds were dying of COVID every day.

The track and trace project was rapidly developed (at great cost) and it’s generally understood that much of the early data work was handled manually. It was later reported that the reason for this error was an Excel accident. A CSV of test results was loaded into Excel for editing, then submitted to a main database.

It’s speculated that this is the point at which results were missed off, since anything over the million mark simply wouldn’t make it into an Excel sheet – or worse just 65.5K results if the Excel file PHE happened to be working with was an older format.

While British service data going missing is a meme at this point, this incident highlights one of the biggest flaws of attempting to work with important (not just big) data outside of a controlled environment.

‘Lineage’ becomes very important to longer or more complex analysis. Work needs to be peer-reviewed. The steps to reach a conclusion needs to be repeatable. Anyone should be able to question the logic of an analysis and update it if necessary.

In short, when it comes to analysis that matters; the make-or-break, life-and-death kind of matters, a spreadsheet called summary_analysis-final-FINAL-01.xlsx is a recipe for disaster.