Managing Moving Migrations: I need lots of test data and I need it fast

Sebastian Rogers
3 min readJun 20, 2023

TL;DR You can’t use real data for testing but there aren’t any good sets of test data available on the internet so you’ll need to create your own and here’s a handy tool to help.

We have a client performing a small, ~ 1TB, data migration but one that’s time sensitive, it has to complete between Friday end of play and Monday start of play.

Not a problem except we have no access to the data until much closer to the transfer day. So we need to do some timings to check how long it will take and if we can move it in a single, long migration, or if we’ll have to do multiple ‘deltas’.

We know that the documents to be moved are a collection of the usual business suspects, Microsoft Office, PDF and PNG. We know that they total about 1TB so all we need is a set of such documents and then we can do a ‘dummy’ run.

Just two problems:

  1. We can’t use their live systems, these are subject to data retention policies and they have a sensible ‘no test data in production’ policy as well.
  2. We can use their test systems, but these are subject to a sensible ‘no live data in test’ policy.

So we need a Terrabyte of test documents.

No problems I thought someone will have generated such a thing. There is the Use sample data packs with your Microsoft 365 Developer Program subscription | Microsoft Learn which should give some documents but doesn’t.

Okay so we’re going to have to do it ourselves.

The problem is that based on Metadata Consulting [dot] ca: What is the average size of office and pdf documents? this is 321 Kb per document so 1 Tb would be 3,115,264 documents, but that data report is old modern document formats tend to be much smaller, in fact checking a live system the average document size these days is under 10Kb. No way are we generating all these by hand.

If we had a good lorem ipsum generator then we could generate sample documents with random text, I’ll deal with how to do that next week. However even this won’t help that much as they aren’t fast enough to generate Terabytes of data, certainly not the Open Source ones.

So let’s say we have a reasonable set of test documents, maybe a couple of hundred, what we need is something that will use these to build a random file structure populated by copying these documents randomly into that structure. We’ll need to be able to say how deep that structure can be and how much data we need but other than that it just needs to be fast. We also need something that’s easy to run with a minimum of overhead.

Sounds like a job for PowerShell, or bash.

PowerShell its not fast but its available for Linux and Windows and it means all the source code can be delivered from Git so you can see exactly what its doing, if I’d written a C program it would have to be compiled and you never know what ‘extras’ I might have included.

Rather than go through the details I’ve put it all in a git repository for you here: Simple-Innovation/simple-test-data: Provide sets of test data (github.com)

It has a full README that will tell you how to use it

TL;DR You can’t use real data for testing but there aren’t any good sets of test data available on the internet so you’ll need to create your own and here’s a handy tool to help.

--

--

Sebastian Rogers

Technical Director for Simple Innovations Ltd. First paid for code in 1980, but still has all his own hair.