Generate Sample Data from DDL

1951 words · 10 min read

This document explains how to generate sample data for any database automatically from its database schema defined as DDL statements.

To do this, I use datafiller which is a command line tool that generates random data from database schemas described in DDL.

v01: Basic datafiller Usage

Here is the first version of my database schema with embedded datafiller directives. In this first version, there is only one datafiller directive. It is in this line: CREATE TABLE Pln_Orl ( --df: mult=2.0

cat data/ddl_to_data_schema_01.sql
## CREATE TABLE Order_Line ( 
##   order_line_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
##   purchase_order_id TEXT
##   );
## CREATE TABLE Plan (
##   plan_id TEXT,
##   title TEXT
##   );
## CREATE TABLE Pln_Orl (  --df: mult=2.0
##   plan_id TEXT,
##   order_line_id INT REFERENCES Order_Line (order_line_id) ON UPDATE CASCADE
##   );

df: mult=2.0 means to multiply number of samples to be generated for the corresponding entity with 2.0. When running datafiller we specify size of the default sample set. Here I specified it as --size=2, so datafiller will generate 4 rows of Pln_Orl.

datafiller --size=2 data/ddl_to_data_schema_01.sql > data/ddl_to_data_data_01.sql
cat data/ddl_to_data_data_01.sql | sed -n "/COPY/,+2 p"
## COPY order_line (order_line_id,purchase_order_id) FROM STDIN (ENCODING 'utf-8');
## 1    purchase
## 2    purchase
## COPY plan (plan_id,title) FROM STDIN (ENCODING 'utf-8');
## plan_id_1_   title_1_1_1_
## plan_id_1_   title_1_1_1_
## COPY pln_orl (plan_id,order_line_id) FROM STDIN (ENCODING 'utf-8');
## plan_id_2_2_2_2_ 1
## plan_id_2_2_2_2_ 2

v02: Use BIGINT Type for FK and PK Fields

This time I used BIGINT type for FK and PK fields instead of TEXT type.

cat data/ddl_to_data_schema_02.sql
## CREATE TABLE order_line (
##   order_line_id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
##   ,  shipment_date TIMESTAMPTZ
##   ,  irsaliye_kg TEXT
##   ,  tesis TEXT
##   ,  yas_toz BOOLEAN
##   ) ;
## CREATE TABLE plan (
##   plan_id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
##   , title TEXT
##   ) ;
## CREATE TABLE pln_orl ( --df: mult=2.0
##   plan_id BIGINT REFERENCES plan (plan_id)
##   ,  order_line_id BIGINT REFERENCES order_line (order_line_id)
##   ) ;
datafiller --size=2 data/ddl_to_data_schema_02.sql > data/ddl_to_data_data_02.sql
cat data/ddl_to_data_data_02.sql | sed -n "/COPY/,+2 p"
## COPY order_line (shipment_date,irsaliye_kg,tesis,yas_toz) FROM STDIN (ENCODING 'utf-8');
## 2018-03-20 10:12:19  irsaliye_kg_2_  tesis_1_    FALSE
## 2018-03-20 10:13:19  irsaliye_kg_2_  tesis_1_    FALSE
## COPY plan (title) FROM STDIN (ENCODING 'utf-8');
## title_2_2_2_2_2_
## title_1_1_1_
## COPY pln_orl () FROM STDIN (ENCODING 'utf-8');

All BIGINT fields are left empty. It seems like datafiller does not support BIGINT properly.

v03: Use INTEGER Type for FK and PK Fields

This time I used INTEGER type for FK and PK fields instead of TEXT type.

cat data/ddl_to_data_schema_03.sql
## CREATE TABLE order_line (
##   order_line_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
##   ,  shipment_date TIMESTAMPTZ
##   ,  irsaliye_kg TEXT
##   ,  tesis TEXT
##   ,  yas_toz BOOLEAN
##   ) ;
## CREATE TABLE plan (
##   plan_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
##   , title TEXT
##   ) ;
## CREATE TABLE pln_orl ( --df: mult=2.0
##   plan_id INT NOT NULL REFERENCES plan (plan_id)
##   ,  order_line_id INT NOT NULL REFERENCES order_line (order_line_id)
##   ) ;
datafiller --size=2 data/ddl_to_data_schema_03.sql > data/ddl_to_data_data_03.sql
cat data/ddl_to_data_data_03.sql | sed -n "/COPY/,+2 p"
## COPY order_line (order_line_id,shipment_date,irsaliye_kg,tesis,yas_toz) FROM STDIN (ENCODING 'utf-8');
## 1    2018-03-20 10:12:19 irsaliye_kg_2_  tesis_2_2_2_    FALSE
## 2    2018-03-20 10:13:19 irsaliye_k  tesis_1_    FALSE
## COPY plan (plan_id,title) FROM STDIN (ENCODING 'utf-8');
## 1    title_2_2_2_2_2_
## 2    title_2_2_2_2_2_
## COPY pln_orl (plan_id,order_line_id) FROM STDIN (ENCODING 'utf-8');
## 2    1
## 2    2

This worked. So datafiller supports INTEGER properly.

v04: Use Predefined Reference Ids as FK Data

In real world, I will have some predefined data in my database. I need to refer to them from foreign key fields that reference those data.

I will use the usual command word to reuse predefined list of values:

-- df place_enum: word=./data/ddl_to_data_place_enum.txt
CREATE TABLE place (
  ...
  ,  place_enum INT
  -- df: text=place_enum length=1 
  ...

Note that, place_enum is not a FK managed by database server according to the above definition because it doesn’t have REFERENCES clause yet.

Predefined reference ids are here:

cat data/ddl_to_data_place_enum.txt
## 3000
## 3001
## 3002
cat data/ddl_to_data_schema_04.sql
## -- df place_enum: word=./data/ddl_to_data_place_enum01.txt
## CREATE TABLE place (
##   place_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
##   ,  place_enum INT
##   -- df: text=place_enum length=1 
##   ,  address TEXT 
##   ) ;
datafiller --size=5 data/ddl_to_data_schema_04.sql > data/ddl_to_data_data_04.sql
cat data/ddl_to_data_data_04.sql | sed -n "/COPY/,+2 p"
## COPY place (place_id,place_enum,address) FROM STDIN (ENCODING 'utf-8');
## 1    4   address_3_3_3
## 2    2   address_1_

This works. So, word directive can be used to refer to predefined rows from foreign key fields.

v05: Use Predefined Reference Ids together with REFERENCES clause

I want to use REFERENCES clause of SQL in order to let the database server assure the integrity of foreign key data.

-- df place_enum: word=./data/ddl_to_data_place_enum.txt
CREATE TABLE enum_value (
  enum_value_id INT 
  -- df: text=place_enum length=1 
  ) ;
CREATE TABLE place ( --df: mult=2.0
  ...
  ,  place_enum INT REFERENCES enum_value (enum_value_id)
  -- df: text=place_enum length=1 
  ...
cat data/ddl_to_data_schema_05.sql
## -- df place_enum: word=./data/ddl_to_data_place_enum01.txt
## CREATE TABLE enum_value (
##   enum_value_id INT PRIMARY KEY
##   -- df: text=place_enum length=1 
##   ) ;
## CREATE TABLE place ( --df: mult=2.0
##   place_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
##   ,  place_enum INT  REFERENCES enum_value (enum_value_id)
##   -- df: text=place_enum length=1 
##   ,  address TEXT 
##   ) ;
datafiller --size=5 data/ddl_to_data_schema_05.sql > data/ddl_to_data_data_05.sql
cat data/ddl_to_data_data_05.sql | sed -n "/COPY/,+2 p"
## COPY enum_value (enum_value_id) FROM STDIN (ENCODING 'utf-8');
## 1
## 2
## COPY place (place_id,place_enum,address) FROM STDIN (ENCODING 'utf-8');
## 1    4   address_8_8_8_8_
## 2    2   address_9_9_9

This works too. So, we can use REFERENCES clause such that database server assures data integrity between tables that are related to each other.

v06: Generate Predefined Reference Id File From RDB

RDB is an abbreviation that I use for requirements database. I try to store all software specifications as structured data in a relational database format. For example, I keep predefined data for my software inside these RDB files. Then I generate their SQL INSERT statements automatically and feed the database during deployment.

RDB files keep the predefined data such as enum values. This kind of data is at a higher knowledge level than other data. I will use datafiller to generate randomized data for all other tables.

We want to generate predefined reference ids from RDB data of the entity enum_value

library(magrittr)
evl = tibble::tribble(
  ~enum_value_id,          ~enum_value_name, ~enum_var_id,
            1000,         "departure depot",            1,
            1001,            "return depot",            1,
            1002,               "pick stop",            1,
            1003,               "drop stop",            1,
            1004,          "pick/drop stop",            1,
            1005,  "departure/return depot",            1
  )
data_entities = list("enum_value" = evl)
get_entity_name_to_id = function(data_entities) {
  by_name_by_id = function(entity_name, data_entities) {
    id_var = sprintf("%s_id", entity_name)
    data_entities[[entity_name]][[id_var]]
  }
  lapply(names(data_entities), by_name_by_id, data_entities) %>%
    setNames(names(data_entities))
  # rutils::lnapply(data_entities, by_name_by_id) 
}
m = get_entity_name_to_id(data_entities)
writeLines(as.character(m$enum_value), "./data/ddl_to_data_enum_value_id.txt")
cat data/ddl_to_data_schema_06.sql
## -- df enum_value_id: word=./data/ddl_to_data_enum_value_id.txt
## CREATE TABLE enum_value (
##   enum_value_id INT PRIMARY KEY
##   -- df: text=enum_value_id length=1 
##   ) ;
## CREATE TABLE place ( --df: mult=2.0
##   place_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
##   ,  enum_value_id INT  REFERENCES enum_value (enum_value_id)
##   -- df: text=enum_value_id length=1 
##   ,  address TEXT 
##   ) ;

Note that word attribute refers to the generated id file: ./data/ddl_to_data_enum_value_id.txt

datafiller --size=5 data/ddl_to_data_schema_06.sql > data/ddl_to_data_data_06.sql
cat data/ddl_to_data_data_06.sql | sed -n "/COPY/,+2 p"
## COPY enum_value (enum_value_id) FROM STDIN (ENCODING 'utf-8');
## 1000
## 1001
## COPY place (place_id,enum_value_id,address) FROM STDIN (ENCODING 'utf-8');
## 1    1005    address_4_
## 2    1004    address_5_5_5_5_

v07: Filter Enum Values by Enum Variables

In my actual software, I define a single table for all nominal (enumerated) values which I call enum_value. All nominal variables which I call enum_var refer to these values. But different nominal variables have different nominal values. For example, place_type and weight_unit are two nominal variables whose values are elements of the sets {"factory", "depot", "storehouse"} and {"kg", "ton", "g"}. I don’t want to define a new table for the value set of each enum_var. Therefore, I collect value sets of all different enum_vars inside one table which is enum_value.

For example, the following dataframe for enum_value contain values of 3 different enum_var instances.

evl = tibble::tribble(
  ~enum_value_id,          ~enum_value_name, ~enum_var_id,
            1000,         "departure depot",            1,
            1001,            "return depot",            1,
            1002,               "pick stop",            1,
            1003,               "drop stop",            1,
            1004,          "pick/drop stop",            1,
            1005,  "departure/return depot",            1,
            2000,               "pick/load",            2,
            2001,             "drop/unload",            2,
            3000,                   "depot",            3,
            3001,                 "factory",            3,
            3002,                "customer",            3
  )

evr = tibble::tribble(
  ~enum_var_id,     ~enum_var_name,
             1,  "route_stop_enum",
             2,      "action_enum",
             3,       "place_enum"
  )

place_enum field in the place table should contain those enum_value instances that belong to place_enum enum_var.

To assure this constraint, I will use dplyr::filter function before generating different predefined id reference files for datafiller.

get_enum_var_name_to_id = function(evl, evr) {
  get_enum_value_id_by = function(evn, evl_evr) {
    evl_evr %>%
      dplyr::filter(enum_var_name == evn) %>%
      magrittr::extract2("enum_value_id")
  }
  evl_evr = evl %>%
    dplyr::left_join(evr, by = "enum_var_id") 
  lapply(unique(evl_evr$enum_var_name), get_enum_value_id_by, evl_evr) %>%
    setNames(unique(evl_evr$enum_var_name))
}
m = get_enum_var_name_to_id(evl, evr)
m
## $route_stop_enum
## [1] 1000 1001 1002 1003 1004 1005
## 
## $action_enum
## [1] 2000 2001
## 
## $place_enum
## [1] 3000 3001 3002
for (evn in names(m)) {
  writeLines(as.character(m[[evn]]), sprintf("./data/ddl_to_data_%s.txt" , evn))
}

writeLines writes id values of enum variables into the following files:

ls data/*_enum.txt
## data/action_enum.txt
## data/ddl_to_data_action_enum.txt
## data/ddl_to_data_place_enum.txt
## data/ddl_to_data_route_stop_enum.txt
## data/place_enum.txt
## data/route_stop_enum.txt
cat data/ddl_to_data_schema_07.sql
## -- df place_enum: word=./data/ddl_to_data_place_enum.txt
## -- df action_enum: word=./data/ddl_to_data_action_enum.txt
## CREATE TABLE enum_value (
##   enum_value_id INT PRIMARY KEY
##   -- df: text=place_enum length=1 
##   ) ;
## CREATE TABLE place ( --df: mult=2.0
##   place_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
##   ,  place_enum INT REFERENCES enum_value (enum_value_id)
##   -- df: text=place_enum length=1 
##   ,  address TEXT 
##   ,  action_enum INT REFERENCES enum_value (enum_value_id)
##   -- df: text=action_enum length=1 
##   ) ;

Note that this time there are 2 different word attributes that refer to two separate reference id files: ./data/ddl_to_data_place_enum.txt and ./data/ddl_to_data_action_enum.txt

datafiller --size=5 data/ddl_to_data_schema_07.sql > data/ddl_to_data_data_07.sql
cat data/ddl_to_data_data_07.sql | sed -n "/COPY/,+2 p"
## COPY enum_value (enum_value_id) FROM STDIN (ENCODING 'utf-8');
## 3000
## 3001
## COPY place (place_id,place_enum,address,action_enum) FROM STDIN (ENCODING 'utf-8');
## 1    3000    address_7_  2001
## 2    3000    address_1_  2000

Note that, place_enum and action_enum values are sampled from their corresponding id reference files.

v08: Ordering of Tables: First Independent Tables Then Dependent Ones

datafiller gives KeyError when the order of the referenced tables is wrong.

For example, the following schema definition will give an error because enum_var is referred before its definition:

CREATE TABLE enum_value (
  enum_value_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
  ,  enum_var_id INT NOT NULL REFERENCES enum_var (enum_var_id)
  ,  enum_value_name TEXT 
  ) ;
CREATE TABLE enum_var (
  enum_var_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
  ,  enum_var_name TEXT 
  ) ;

Step 2: Importing Data into PostgreSQL Database

Now, we are going to start postgresql database server and import the generated data. As database server, I use a docker container published by https://github.com/subzerocloud/postgrest-starter-kit

docker start postgreststarterkit_db_1

I put user credentials for this database as environment variables into .zshrc:

export PGHOST=localhost
export PGPORT=5432
export PGUSER=superuser
export PGDATABASE=app

Now, I need to run the files that contain DDL and COPY SQL commands using psql tool. I can run those files separately one by one. But instead I made a new sql file that first initializes database schema by dropping it and then running DDL and COPY files.

psql -d app -h localhost -p 5432 -U superuser -f data/ddl_to_data_init_01.sql
cat data/ddl_to_data_init_01.sql
## drop schema if exists ddl_to_data cascade;
## create schema ddl_to_data;
## set search_path = ddl_to_data, public;
## 
## \ir schema.sql
## \ir data.sql

v09: Exclude some tables

Data for predefined tables comes from external sources. In order to exclude these tables from datafiller use nogen directive:

CREATE TABLE enum_var ( -- df: nogen

 Tech    19 Jan, 2018

Any work (images, writings, presentations, ideas or whatever) which I own is always provided under
Creative Commons License Creative Commons Attribution-Share Alike 3.0 License

Mert Nuhoglu is a Trabzon-born programmer and data scientist.

You may also like...