Generate Sample Data from DDL
This document explains how to generate sample data for any database automatically from its database schema defined as DDL
statements.
Table of Contents
This document explains how to generate sample data for any database automatically from its database schema defined as DDL
statements.
To do this, I use datafiller
which is a command line tool that generates random data from database schemas described in DDL.
v01: Basic datafiller Usage
Here is the first version of my database schema with embedded datafiller directives. In this first version, there is only one datafiller directive. It is in this line: CREATE TABLE Pln_Orl ( --df: mult=2.0
cat data/ddl_to_data_schema_01.sql
##> CREATE TABLE Order_Line (
##> order_line_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
##> purchase_order_id TEXT
##> );
##> CREATE TABLE Plan (
##> plan_id TEXT,
##> title TEXT
##> );
##> CREATE TABLE Pln_Orl ( --df: mult=2.0
##> plan_id TEXT,
##> order_line_id INT REFERENCES Order_Line (order_line_id) ON UPDATE CASCADE
##> );
df: mult=2.0
means to multiply number of samples to be generated for the corresponding entity with 2.0
. When running datafiller
we specify size of the default sample set. Here I specified it as --size=2
, so datafiller
will generate 4 rows of Pln_Orl
.
datafiller --size=2 data/ddl_to_data_schema_01.sql > data/ddl_to_data_data_01.sql
cat data/ddl_to_data_data_01.sql | sed -n "/COPY/,+2 p"
##> COPY order_line (order_line_id,purchase_order_id) FROM STDIN (ENCODING 'utf-8');
##> 1 purchase
##> 2 purchase
##> COPY plan (plan_id,title) FROM STDIN (ENCODING 'utf-8');
##> plan_id_1_ title_1_1_1_
##> plan_id_2_2_2_2_ title_1_1_1_
##> COPY pln_orl (plan_id,order_line_id) FROM STDIN (ENCODING 'utf-8');
##> plan_id_2_2_2_2_ 2
##> plan_id_2_2_2_2_ 2
v02: Use BIGINT Type for FK and PK Fields
This time I used BIGINT type for FK and PK fields instead of TEXT type.
cat data/ddl_to_data_schema_02.sql
##> CREATE TABLE order_line (
##> order_line_id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
##> , shipment_date TIMESTAMPTZ
##> , irsaliye_kg TEXT
##> , tesis TEXT
##> , yas_toz BOOLEAN
##> ) ;
##> CREATE TABLE plan (
##> plan_id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
##> , title TEXT
##> ) ;
##> CREATE TABLE pln_orl ( --df: mult=2.0
##> plan_id BIGINT REFERENCES plan (plan_id)
##> , order_line_id BIGINT REFERENCES order_line (order_line_id)
##> ) ;
##>
datafiller --size=2 data/ddl_to_data_schema_02.sql > data/ddl_to_data_data_02.sql
cat data/ddl_to_data_data_02.sql | sed -n "/COPY/,+2 p"
##> COPY order_line (shipment_date,irsaliye_kg,tesis,yas_toz) FROM STDIN (ENCODING 'utf-8');
##> 2018-12-11 19:30:11 irsaliye_k tesis_1_ TRUE
##> 2018-12-11 19:30:11 irsaliye_k tesis_2_2_2_ TRUE
##> COPY plan (title) FROM STDIN (ENCODING 'utf-8');
##> title_1_1_1_
##> title_2_2_2_2_2_
##> COPY pln_orl () FROM STDIN (ENCODING 'utf-8');
##>
##>
All BIGINT fields are left empty. It seems like datafiller does not support BIGINT properly.
v03: Use INTEGER Type for FK and PK Fields
This time I used INTEGER type for FK and PK fields instead of TEXT type.
cat data/ddl_to_data_schema_03.sql
##> CREATE TABLE order_line (
##> order_line_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
##> , shipment_date TIMESTAMPTZ
##> , irsaliye_kg TEXT
##> , tesis TEXT
##> , yas_toz BOOLEAN
##> ) ;
##> CREATE TABLE plan (
##> plan_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
##> , title TEXT
##> ) ;
##> CREATE TABLE pln_orl ( --df: mult=2.0
##> plan_id INT NOT NULL REFERENCES plan (plan_id)
##> , order_line_id INT NOT NULL REFERENCES order_line (order_line_id)
##> ) ;
##>
##>
datafiller --size=2 data/ddl_to_data_schema_03.sql > data/ddl_to_data_data_03.sql
cat data/ddl_to_data_data_03.sql | sed -n "/COPY/,+2 p"
##> COPY order_line (order_line_id,shipment_date,irsaliye_kg,tesis,yas_toz) FROM STDIN (ENCODING 'utf-8');
##> 1 2018-12-11 19:30:12 irsaliye_kg_2_ tesis_2_2_2_ TRUE
##> 2 2018-12-11 19:31:12 irsaliye_kg_2_ tesis_1_ FALSE
##> COPY plan (plan_id,title) FROM STDIN (ENCODING 'utf-8');
##> 1 title_1_1_1_
##> 2 title_1_1_1_
##> COPY pln_orl (plan_id,order_line_id) FROM STDIN (ENCODING 'utf-8');
##> 1 2
##> 1 2
This worked. So datafiller
supports INTEGER properly.
v04: Use Predefined Reference Ids as FK Data
In real world, I will have some predefined data in my database. I need to refer to them from foreign key fields that reference those data.
I will use the usual command word
to reuse predefined list of values:
-- df place_enum: word=./data/ddl_to_data_place_enum.txt
CREATE TABLE place (
...
, place_enum INT
-- df: text=place_enum length=1
...
Note that, place_enum
is not a FK managed by database server according to the above definition because it doesn’t have REFERENCES
clause yet.
Predefined reference ids are here:
cat data/ddl_to_data_place_enum.txt
##> 3000
##> 3001
##> 3002
cat data/ddl_to_data_schema_04.sql
##> -- df place_enum: word=./data/ddl_to_data_place_enum01.txt
##> CREATE TABLE place (
##> place_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
##> , place_enum INT
##> -- df: text=place_enum length=1
##> , address TEXT
##> ) ;
##>
datafiller --size=5 data/ddl_to_data_schema_04.sql > data/ddl_to_data_data_04.sql
cat data/ddl_to_data_data_04.sql | sed -n "/COPY/,+2 p"
##> COPY place (place_id,place_enum,address) FROM STDIN (ENCODING 'utf-8');
##> 1 5 address_1_
##> 2 3 address_1_
This works. So, word
directive can be used to refer to predefined rows from foreign key fields.
v05: Use Predefined Reference Ids together with REFERENCES clause
I want to use REFERENCES
clause of SQL
in order to let the database server assure the integrity of foreign key data.
-- df place_enum: word=./data/ddl_to_data_place_enum.txt
CREATE TABLE enum_value (
enum_value_id INT
-- df: text=place_enum length=1
) ;
CREATE TABLE place ( --df: mult=2.0
...
, place_enum INT REFERENCES enum_value (enum_value_id)
-- df: text=place_enum length=1
...
cat data/ddl_to_data_schema_05.sql
##> -- df place_enum: word=./data/ddl_to_data_place_enum01.txt
##> CREATE TABLE enum_value (
##> enum_value_id INT PRIMARY KEY
##> -- df: text=place_enum length=1
##> ) ;
##> CREATE TABLE place ( --df: mult=2.0
##> place_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
##> , place_enum INT REFERENCES enum_value (enum_value_id)
##> -- df: text=place_enum length=1
##> , address TEXT
##> ) ;
##>
##>
datafiller --size=5 data/ddl_to_data_schema_05.sql > data/ddl_to_data_data_05.sql
cat data/ddl_to_data_data_05.sql | sed -n "/COPY/,+2 p"
##> COPY enum_value (enum_value_id) FROM STDIN (ENCODING 'utf-8');
##> 1
##> 2
##> COPY place (place_id,place_enum,address) FROM STDIN (ENCODING 'utf-8');
##> 1 1 address_4_
##> 2 5 address_8_8_8_8_
This works too. So, we can use REFERENCES
clause such that database server assures data integrity between tables that are related to each other.
v06: Generate Predefined Reference Id File From RDB
RDB is an abbreviation that I use for requirements database. I try to store all software specifications as structured data in a relational database format. For example, I keep predefined data for my software inside these RDB
files. Then I generate their SQL INSERT
statements automatically and feed the database during deployment.
RDB
files keep the predefined data such as enum values. This kind of data is at a higher knowledge level than other data. I will use datafiller
to generate randomized data for all other tables.
We want to generate predefined reference ids from RDB
data of the entity enum_value
library(magrittr)
evl = tibble::tribble(
~enum_value_id, ~enum_value_name, ~enum_var_id,
1000, "departure depot", 1,
1001, "return depot", 1,
1002, "pick stop", 1,
1003, "drop stop", 1,
1004, "pick/drop stop", 1,
1005, "departure/return depot", 1
)
data_entities = list("enum_value" = evl)
get_entity_name_to_id = function(data_entities) {
by_name_by_id = function(entity_name, data_entities) {
id_var = sprintf("%s_id", entity_name)
data_entities[[entity_name]][[id_var]]
}
lapply(names(data_entities), by_name_by_id, data_entities) %>%
setNames(names(data_entities))
# rutils::lnapply(data_entities, by_name_by_id)
}
m = get_entity_name_to_id(data_entities)
writeLines(as.character(m$enum_value), "./data/ddl_to_data_enum_value_id.txt")
cat data/ddl_to_data_schema_06.sql
##> -- df enum_value_id: word=./data/ddl_to_data_enum_value_id.txt
##> CREATE TABLE enum_value (
##> enum_value_id INT PRIMARY KEY
##> -- df: text=enum_value_id length=1
##> ) ;
##> CREATE TABLE place ( --df: mult=2.0
##> place_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
##> , enum_value_id INT REFERENCES enum_value (enum_value_id)
##> -- df: text=enum_value_id length=1
##> , address TEXT
##> ) ;
##>
##>
Note that word
attribute refers to the generated id file: ./data/ddl_to_data_enum_value_id.txt
datafiller --size=5 data/ddl_to_data_schema_06.sql > data/ddl_to_data_data_06.sql
cat data/ddl_to_data_data_06.sql | sed -n "/COPY/,+2 p"
##> COPY enum_value (enum_value_id) FROM STDIN (ENCODING 'utf-8');
##> 1000
##> 1001
##> COPY place (place_id,enum_value_id,address) FROM STDIN (ENCODING 'utf-8');
##> 1 1004 address_8_8_8_8_
##> 2 1003 address_7_
v07: Filter Enum Values by Enum Variables
In my actual software, I define a single table for all nominal (enumerated) values which I call enum_value
. All nominal variables which I call enum_var
refer to these values. But different nominal variables have different nominal values. For example, place_type
and weight_unit
are two nominal variables whose values are elements of the sets {"factory", "depot", "storehouse"}
and {"kg", "ton", "g"}
. I don’t want to define a new table for the value set of each enum_var
. Therefore, I collect value sets of all different enum_var
s inside one table which is enum_value
.
For example, the following dataframe for enum_value
contain values of 3 different enum_var
instances.
evl = tibble::tribble(
~enum_value_id, ~enum_value_name, ~enum_var_id,
1000, "departure depot", 1,
1001, "return depot", 1,
1002, "pick stop", 1,
1003, "drop stop", 1,
1004, "pick/drop stop", 1,
1005, "departure/return depot", 1,
2000, "pick/load", 2,
2001, "drop/unload", 2,
3000, "depot", 3,
3001, "factory", 3,
3002, "customer", 3
)
evr = tibble::tribble(
~enum_var_id, ~enum_var_name,
1, "route_stop_enum",
2, "action_enum",
3, "place_enum"
)
place_enum
field in the place
table should contain those enum_value
instances that belong to place_enum
enum_var
.
To assure this constraint, I will use dplyr::filter
function before generating different predefined id reference files for datafiller.
get_enum_var_name_to_id = function(evl, evr) {
get_enum_value_id_by = function(evn, evl_evr) {
evl_evr %>%
dplyr::filter(enum_var_name == evn) %>%
magrittr::extract2("enum_value_id")
}
evl_evr = evl %>%
dplyr::left_join(evr, by = "enum_var_id")
lapply(unique(evl_evr$enum_var_name), get_enum_value_id_by, evl_evr) %>%
setNames(unique(evl_evr$enum_var_name))
}
m = get_enum_var_name_to_id(evl, evr)
m
## $route_stop_enum
## [1] 1000 1001 1002 1003 1004 1005
##
## $action_enum
## [1] 2000 2001
##
## $place_enum
## [1] 3000 3001 3002
for (evn in names(m)) {
writeLines(as.character(m[[evn]]), sprintf("./data/ddl_to_data_%s.txt" , evn))
}
writeLines
writes id values of enum variables into the following files:
ls data/*_enum.txt
##> data/action_enum.txt
##> data/ddl_to_data_action_enum.txt
##> data/ddl_to_data_place_enum.txt
##> data/ddl_to_data_route_stop_enum.txt
##> data/place_enum.txt
##> data/route_stop_enum.txt
cat data/ddl_to_data_schema_07.sql
##> -- df place_enum: word=./data/ddl_to_data_place_enum.txt
##> -- df action_enum: word=./data/ddl_to_data_action_enum.txt
##> CREATE TABLE enum_value (
##> enum_value_id INT PRIMARY KEY
##> -- df: text=place_enum length=1
##> ) ;
##> CREATE TABLE place ( --df: mult=2.0
##> place_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
##> , place_enum INT REFERENCES enum_value (enum_value_id)
##> -- df: text=place_enum length=1
##> , address TEXT
##> , action_enum INT REFERENCES enum_value (enum_value_id)
##> -- df: text=action_enum length=1
##> ) ;
##>
##>
Note that this time there are 2 different word
attributes that refer to two separate reference id files: ./data/ddl_to_data_place_enum.txt
and ./data/ddl_to_data_action_enum.txt
datafiller --size=5 data/ddl_to_data_schema_07.sql > data/ddl_to_data_data_07.sql
cat data/ddl_to_data_data_07.sql | sed -n "/COPY/,+2 p"
##> COPY enum_value (enum_value_id) FROM STDIN (ENCODING 'utf-8');
##> 3000
##> 3001
##> COPY place (place_id,place_enum,address,action_enum) FROM STDIN (ENCODING 'utf-8');
##> 1 3001 address_4_ 2001
##> 2 3000 address_4_ 2001
Note that, place_enum
and action_enum
values are sampled from their corresponding id reference files.
v08: Ordering of Tables: First Independent Tables Then Dependent Ones
datafiller
gives KeyError
when the order of the referenced tables is wrong.
For example, the following schema definition will give an error because enum_var
is referred before its definition:
CREATE TABLE enum_value (
enum_value_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
, enum_var_id INT NOT NULL REFERENCES enum_var (enum_var_id)
, enum_value_name TEXT
) ;
CREATE TABLE enum_var (
enum_var_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
, enum_var_name TEXT
) ;
Step 2: Importing Data into PostgreSQL Database
Now, we are going to start postgresql database server and import the generated data. As database server, I use a docker container published by https://github.com/subzerocloud/postgrest-starter-kit
docker start postgreststarterkit_db_1
I put user credentials for this database as environment variables into .zshrc
:
export PGHOST=localhost
export PGPORT=5432
export PGUSER=superuser
export PGDATABASE=app
Now, I need to run the files that contain DDL
and COPY
SQL commands using psql
tool. I can run those files separately one by one. But instead I made a new sql
file that first initializes database schema by dropping it and then running DDL
and COPY
files.
psql -d app -h localhost -p 5432 -U superuser -f data/ddl_to_data_init_01.sql
cat data/ddl_to_data_init_01.sql
##> drop schema if exists ddl_to_data cascade;
##> create schema ddl_to_data;
##> set search_path = ddl_to_data, public;
##>
##> \ir schema.sql
##> \ir data.sql
##>
v09: Exclude some tables
Data for predefined tables comes from external sources. In order to exclude these tables from datafiller
use nogen
directive:
CREATE TABLE enum_var ( -- df: nogen