Until then, when I taught SQL or PL/SQL training, the “universal” tables EMP and DEPT were enough for me to demonstrate practically everything.
The problem came when I started needing examples with a bit more volume, where I had to handle real cardinalities, estimates, and execution plans with significant cost changes.
That was when I decided to create a model from scratch using synthetic data, and to make it open to anyone who wanted to use it.
I didn’t want to make the typical e-commerce dataset. I wanted something that could evolve towards a future Data Warehouse, and that would also allow me to show analytical query examples (for example).
And for that, I needed to load the model with a large number of rows.
In the end, I recovered a dataset that I had created years earlier for some tech conferences in Pamplona.
In that talk, I went through the entire life cycle of an application, from its very origins to the exploitation of historical data over time. In that talk, I did an “all-in-one” demo of:
- the creation of the analysis and physical model (Oracle Designer 2000),
- an application with Oracle Forms and reports with Oracle Reports,
- The ETLs that would send data to a Data Warehouse with Oracle Warehouse Builder,
- the latter multidimensional exploitation with OLAP cubes (Oracle Discoverer),
- and even the prediction of patterns with Oracle DataMiner.
I called that model FLIGHTS.
Hey, not bad at all!
So I decided to use it in the book to explain execution plans, statistics, cardinality, selectivity, and optimizer behavior with some depth.
For years, it was available as a simple export version 9i. When Data Pump did not exist yet.
I have published it in full
At last, I decided to publish it on GitHub, with all the code needed to generate it from scratch:
👉 https://github.com/CafeDatabase/flights-dataset
I reviewed the scripts, corrected some terrible typos (no more “Washintong” or “Lion”), standardized fictional airline names, validated real IATA codes, and added a small final validator to ensure the generation meets minimum requirements.
Why that validator?
Because the model has a deliberately random component. I designed it that way on purpose to generate some degree of skew and, in the Data Mining part, allow hidden patterns to appear as a result of chance.
In the load I used for that conference, and that I also used in the book, something quite funny happened:
MALE customers with marital status MARRIED seemed to prefer traveling to ROME.
And in the conference, obviously, I would throw out the prediction: “Married men go to Rome”.
Look, it doesn’t always happen. It’s random. And that’s also part of the learning. It’s possible that when you generate this model, you’ll get a different coincidence… or none.
What does this dataset consist of?
It is a synthetic, performance-oriented model, with:
- 57,711 flights
- 171,113 bookings
- 2,885,550 seats
- 34 active airports
- 102 agencies with bookings
- Seats/flights ratio: 50
- Controlled distributions and a certain degree of skew
In reality, it is not meant to be “realistic” from a functional standpoint. It is meant to be useful didactically.
And that is very different.
You shouldn’t see it as the model that real airlines follow. They probably don’t mix customer information with agencies or share sales data between companies, but for teaching entity-relationship models and seeing a somewhat larger database schema, I think it is ideal. At least, thanks to it, I was able to generate examples that showed the impact of Cartesian products, unnecessary sorts, and significantly different costs between access plans, etc.
In the end, the simpler the model, the more powerful the learning.
Why make it public now?
For two reasons:
- because I am working on the second edition of the book “SQL Optimization in Oracle”
- because I am going to lead a Master’s in SQL Optimization in Oracle in the Café Database Academy.
The thing is that, in everything I do, I always try to ensure that part of my work is an open contribution to the community. I already did it with the first edition of the book, and the dataset had thousands of downloads.
Also, something happened to me: I tried to find an open dataset with synthetic data, simple, only for educational purposes… but I couldn’t find it.
I found datasets with complex models, and what I wanted was to be able to follow these principles:
- That the example is simple.
- That it involves the minimum number of tables/objects possible.
- That the result is visually identifiable.
I didn’t want to use a medical dataset and have readers untangle the relational model in every example. Or that the difference between sorting 10 rows, hashing 10 rows, or doing a nested loops over 10 rows gave the same cost.
So this dataset has only simple tables: CUSTOMERS, BOOKINGS, FLIGHTS, SEATS, CITIES, COUNTRIES, AGENCIES…
The simpler the example, the more powerful the learning.
And it surprised me that, with minimal variations, I could cover almost all the examples in the book where the optimizer estimates wrong cardinalities, adaptive execution plans, and non-uniform selectivities…
And I ended up quite satisfied, to be honest.
What can you do with it?
- Practice SQL and PL/SQL.
- Load it into an Autonomous Free Tier if you don’t want to set up a server.
- Use it in classes.
- Build a prototype in APEX.
- Create a demo application for your portfolio.
- …whatever you feel like!
If you are a teacher, you have varied tables with enough data to compute totals at a considerable volume, and if the relational model is very big, you can view it more atomically: customers and their customer info, or only the flights from one airport to another (without looking at bookings), or calculate income per agencies in flight sales,… all of that accessing very few tables and creating simple SQL sets that are easy to explain.
Or if you are learning APEX or programming in any environment that needs a dataset in a database, this gives you enough for a good prototype!
I’m happy if you include a demo application in your portfolio using the data from this dataset!
For me, it’s not just a repository. It’s part of my method.
If you use it and find possible improvements, or if you want to propose interesting scenarios to test, I’ll be happy to hear from you.
This year, I’m going to be very focused on SQL optimization. The book, the Master… and this dataset are three pieces of the same mechanism.
And the dataset is completely open.
Here you go.