DataRoller



Roll your own data”

Version 1.0

User Guide

May 2012
http://dataroller.sourceforge.net/



In a nutshell …

DataRoller is a Java application that will insert piles of good looking data into your favorite databases so you can move in and be happy.







Rich Alberth





Table of Contents

1 For The Impatient 6

2 Overview 8

3 Project Files 9

3.1 Comments 11

3.2 Labels 11

3.3 Tables 11

3.4 Deleting Old Rows 12

3.5 Parent Child Relationships 14

3.5.1 Where Clause 15

3.6 Unique and Unique Per Parent 16

3.6.1 Data Exhaustion 18

3.6.2 Example 19

3.6.3 Independently Unique 20

3.6.4 Sequences 20

3.7 Columns 21

3.8 Data Types 23

3.8.1 Strings 23

3.8.2 Integers 24

3.8.3 Decimals 25

3.8.4 Floats 26

3.8.5 Dates and Times 26

3.8.6 Booleans 27

3.8.7 Nulls 27

3.8.8 Unsupported Data Types 27

3.8.9 Null Expressions 28

3.9 Variables 28

3.9.1 Assignment 28

3.9.2 Variables and randomrow() 30

3.10 Embedded SQL 31

3.10.1 Embed SQL in DataRoller File 32

3.10.2 Reference External SQL File 33

3.11 A Note on File and Folder Names 34

3.12 Lookup Tables 35

4 Generators 37

4.1 Random Data 38

4.1.1 Random Integers 38

4.1.2 Random Decimals 39

4.1.3 Random Floats 39

4.1.4 Random Dates and Timestamps 39

4.1.5 Choice 40

4.1.6 BLOB 41

4.2 Structured Data 42

4.2.1 Lorem Ipsum 42

4.2.2 Sequences 43

4.3 Column 45

4.3.1 Current-Row Reference 47

4.3.2 Separate Table Reference 47

4.3.3 Parent Table Reference 48

4.4 Lookup Data 50

4.4.1 File Row Lookup 50

4.4.2 Folder Contents Lookup 50

4.4.3 XML File Lookup 51

4.4.4 randomrow() 53

4.4.5 Previous Row 54

4.5 SQL 57

4.6 Operators 58

4.6.1 Types, Nulls and No Short-Circuit Logic 60

4.6.2 Boolean Operators 60

4.6.3 Equality Operators 61

4.6.4 Comparison Operators 62

4.6.5 Algebraic Operators 63

4.6.6 String Operators 64

4.6.7 Dates 65

4.6.8 Operator Summary Table 67

4.7 Conditionals 67

4.7.1 If then else 67

4.7.2 Case When 68

4.7.3 Conditionals Example 69

4.8 Raw() 69

5 Functions 71

5.1 String Functions 72

5.1.1 Function pattern() 74

5.1.2 Function guid() 75

5.2 Numeric Functions 76

5.2.1 Integral Functions 76

5.2.2 Floating-point Functions 76

5.3 Date and Timestamp Functions 77

5.4 System Functions 78

5.5 Cryptographic Functions 79

5.6 Data-Type Conversion Functions 79

6 Execution 82

6.1 Command-line Switches 82

6.2 User Preferences and Aliases 84

6.3 Loading JDBC Drivers 85

6.3.1 SQL Server 86

6.3.2 Oracle 86

6.3.3 DB2 86

7 Speeding up DataRoller 87

7.1 DataRoller Generator Relative Costs 87

7.2 Rebuild Indexes 89

7.3 Regenerate Statistics 90

7.4 Lock Tables 92

7.5 Disable Costly Constraints 92

8 Extending DataRoller 93

8.1 User-Supplied Functions 93

8.2 Java Function 94

8.3 Arguments and Return Types 94

8.4 Function Alias 95

8.5 Function Signature 95

8.6 Invocation and Execution 97

8.7 For Example 97

9 Tips & Tricks 100

9.1 Using Delete for Row Partitioning 100

9.2 Dealing with Artificial Primary Keys 101

9.2.1 MySQL 102

9.2.2 SQL Server 103

9.2.3 Oracle Sequences 103

9.3 Avoid Querying Unneeded Data 104

9.4 Mutually Unique randomrow() 105

10 For Reference 108

10.1 Project Syntax Reference 108

10.2 DataRoller Keywords 110

10.3 Syntax Reference 111

10.3.1 Lexical Elements 111

10.3.2 Grammar 111

10.4 JDBC URL Reference 114

10.5 DataRoller License and Included Works 116



1For The Impatient

DataRoller inserts rows into a database based on a file describing your tables and columns.

Execute this against MySQL:

create database dr;
use dr;

create table invoices (
   invoice_id     int,
   alt_key_uid    varchar(50),
   name           varchar(80),
   label          varchar(30),
   status         char(3)
);

Put this in a file called basic.txt:

table invoices
   insert 4 rows
{
   invoice_id    sequence(),
   alt_key_uid   guid(),
   name          lorem(20..80),
   label         pattern("UU-NNNN"),
   status        choice("UNK","SHP","CLS",
                        "OPN","INP")
}

This file defines how DataRoller should go about populating your tables and columns, such as how many rows to insert, and what value to put into each column. Sequence() means 1, then 2, then 3, etc. Read http://lipsum.com for details on lorem(). Pattern() generates a string based on the codes you pass in. “U” for an upper-case letter, “N” for a number (0-9 digit). Choice() just picks one of the values supplied at random.

Download and unzip DataRoller, open a command-window and type this:

dataroller –c jdbc:mysql://localhost/DRTEST –u ROOT –p mypasswd basic.txt

You’ll see this if you have everything installed and running correctly:

DataRoller 1.0

This program comes with ABSOLUTELY NO WARRANTY, is free software, and you
are welcome to redistribute it under certain conditions. See License.txt
for details. Copyright 2012 Rich Alberth.

Invoices                      [....................] 0s
Done (0s)

Run “select * from invoices”:

+------------+------------------------------------+------------+---------+--------+
| invoice_id | alt_key_uid                        | name       | label   | status |
+------------+------------------------------------+------------+---------+--------+
|          1 | 03153499-f392-46a8-aa69-f9ff225971 | Lacus torq | XN-5814 | INP    |
|          2 | afbc0751-ed8b-4773-9678-4bbf84aef7 | Dui maecen | RK-2669 | INP    |
|          4 | 2686521e-cc37-4471-99ec-a2625c6d13 | Lobortis f | ZR-3740 | SHP    |
|          5 | b8a73acc-a376-441e-bff2-9f8567f52d | Amet in no | DC-1536 | OPN    |
+------------+------------------------------------+------------+---------+--------+
4 rows in set (0.00 sec)

That’s it! You write a text file with instructions on what should go into each table and column and execute it via a command-line tool. The input text file supports variables, complicated table relationships, user-supplied functions and much more.

2Overview

DataRoller (motto “Roll your own data!”) is a Java application that will insert piles of good looking data into your favorite databases so you can move in and be happy.

Here are some good ideas:

Load up thousands of rows of data, fire up your app and see what areas need tuning.
Databases execute your SQL SELECT statements differently based on how much data is in each table. Real data means real tuning.

Generate a bunch of values that look just like your production database for your next demo.
Ever say this to a crowd of people “sorry this isn’t how the app will look in production, it’s just test data”?

Throw some stand-in data together quickly for your testing team.
Testers can’t find subtle bugs when they test a screen with only a single record they generated.

Clear out all the rows in your database and reset it with some basic starting data.
After a demo, how do you “reset” your box back to a developer mode?

Rapid prototyping and requirements gathering.
Fill up a new table with good-looking sample data, use a rapid prototyping tool to generate a skeleton UI and demo it to the users during a requirements meeting.

3Project Files

A project file is the main input to DataRoller. This file documents every table that needs populating, what to put in each column, and how tables relate to each other. The syntax feels a little like SQL “create table” statements: each table is listed in the file in the order in which they should be populated. Within each table section, each column is listed with how data should be generated for it.

A guided tour through your first Project File:

table invoices


List each table you want populated, in the order you want them populated in.

      insert 2000 rows


Number of rows to insert (duh). There are a lot of things you can add later in this section between the table name and the “{” for column definitions!

{


Braces group columns per table.

   invoice_id  sequence(),


sequence()” means start at 1, increment each time a new row is inserted.

   name        lorem(20..80),


Generate Lorem Ipsum text between 20 and 50 characters in length.

   status   choice("UNK","SHP","CLS"),


Pick one of these supplied values at random.

   delivery    random(D’2009-4-1’ .. 
                      D’today’
                      step /hour/),


Pick a random date between April 1st 2009 and today, with a time component rounded to the nearest hour (minutes and seconds are always zero).

   last_name   filerow("names.txt"),


filerow()” means pick a random line from the text file listed (stripping line termination chars).

   ship_code   pattern("UU-NNNN"),


Generate a string based on the pattern where “U” is an upper-case letter and “N” is a number (digit).

   ship_label  folder("labels.zip")
}


Pick a random file from the given folder, read it in, and use the contents of the file as the value for this column (good for “blob” and “text” columns).

table invoice_items

   child of invoices on
      this.inv_id = parent.invoice_id

   insert 2..15 rows
{


Child table: generate between 2 and 15 rows in invoice_items for every row generated in table invoices. DataRoller makes the foreign keys work out the right way. There are no invoices with no invoice_items, and no invoices with thousands of items by accident!

   item_id     sequence(10000),


Start these values way above invoice_id values just to keep them apart.

   product     column(products.product_id),


Look-up table: pick a random value from table products, column product_id.

   quantity    random(1..100),


Pick a random integer between 1 and 100.

   price       random(1.00 .. 100.25
                      step 0.25)



}


Pick a random decimal (not a real/float/double, this is specific precision) between 1 and 100.25, divisible by 0.25. I.e., the price always ends in “.00”, “.25”, “.5” or “.75” (quarters).



The rest of this chapter covers the language elements in a project file.

3.1Comments

Comments: two slashes (“//”), two dashes (“--”), or C-style /*…*/

table mytbl             // really an insertable view
       delete all
       insert 0 rows    /* just clear it out */ {
   mycol sequence()     -- no args to sequence() means start at 1
}

3.2Labels

Labels are used to name tables and columns. Labels are:

Note that there is no rule in DataRoller that a label starts with an alphabetic character. Both “12345g” and “__#mytmp6” are valid labels. This is useful for temporary tables that start with “#”.

3.3Tables

A DataRoller project file consists of a sequence of table definitions, with some extra, optional declarations at the top of the file.

A Table definition is made up of:

  1. table” and the name of the table to insert into. A database table can be listed more than once in a DataRoller project file.

  2. Table prologue: a set of table-level things, such as:

    1. Whether to delete everything before inserting new rows

    2. How many rows to insert

    3. Whether this is a child table of another definition in the project file.

  3. Column definitions, each having:

    1. Name of the column to insert into, and

    2. An expression telling DataRoller how to construct values to insert.

Separate Column Definitions in a table with a comma.

The order that table definitions appear in the file is the order in which they are processed and inserted. If you list table products before invoices, and invoices references products, you’re OK: you’ll have data in products before you try to insert rows into invoices.



Anatomy of a simple table definition:

table invoices


      delete all

Optionally include the keywords “delete all” to have DataRoller remove all existing rows before inserting new ones.

      insert 1000 rows

Number of rows to insert, as a fixed amount, or a range of integers. Zero is valid.

{
    pk  sequence()
}

List all column definitions within curly braces. You do not need to use a comma to separate columns as in SQL “create table” statements.

3.4Deleting Old Rows

delete all” in the example above is the simplest way to clear out old data before having DataRoller insert fancy new data. This removes all records from the target table.

DataRoller is meant to be run with only the most basic access to the target tables it operates over. The SQL “truncate” command is much faster than deleting rows from a table, but sometimes requires more rights to the table than “delete”. DataRoller is designed to require no special permissions on the database other than basic select, insert, and delete.

DataRoller is sensitive to the relationships between tables. You would like table vendors earlier in the DataRoller Project file than products if products had a foreign key to vendors: DataRoller should populate the vendors table first, and then populate the products table.

DataRoller does not issue delete statements following this same logic: when executing a DataRoller Project file, the DataRoller Engine work in two passes:

This way, if you have foreign keys referencing tables, you will see the following:

  1. Delete from dependent tables

  2. Delete from parent tables

  3. Insert into parent tables

  4. Insert into dependent tables

This processing is not obvious form the syntax of the DataRoller Project file. It was decided to put the “delete” clause inside the table definition so all facts about a table are defined together in a single location in the file. This avoids a lot of scrolling around while writing a DataRoller Project file. The DataRoller Project file is declarative: you specify what you want the results to be, and DataRoller will make it happen.

The delete statement has two forms:

  1. delete all”, which issues a plain “delete from ___” SQL statement

  2. delete where” followed by a string literal, which issues the same SQL as above with the string literal appended as a “where” clause.

table foo
   delete where “bid < 7000”
{
   ...
}

This example will issue the SQL statement “delete from foo where bid < 7000”. Note that the argument to the “delete where” clause is a string, and not arbitrary SQL. The entire example above is not actual SQL: “delete where” are two DataRoller keywords chosen to look a lot like SQL as a memory aide. The thing after “where” must be a DataRoller string literal. DataRoller assembles the SQL delete statement based on these components.

Because a simple string literal is allowed, you cannot use string concatenation (“+”), call functions, or use generators or any other facility in DataRoller.

3.5Parent Child Relationships

In database parlance, there is only the single “foreign key” concept to handle all design patterns. Database designers use higher-level concepts to capture relationships.

Examples:

DataRoller treats these two types of association differently within a project file. References are handled on a per-column basis, since they just capture picking a random value from the lookup table and using it in the current row being generated.

Composition is represented in a DataRoller Project file with a clause at the table level to show how the parent and child tables relate.

Anatomy of a parent child relationship:

child of table on this.fk = parent.pk

“table” is the name of the parent table, “fk” is the column in the child table with the references constraint on it (the foreign key), and “pk” is the primary key column in the parent table. Note that “this” and “parent” are keywords in DataRoller and are included to make the line more readable, mimicking the syntax of a typical SQL JOIN clause.

Example of a child table definition:

table invoice_items
   delete all
   child of invoices on this.invoice_id = parent.invoice_id
   insert 1..30 rows
{
   ...
}



When specifying a child table in a parent-child relationship, you can specify a range of rows to generate (“insert 1..30 rows” in sample above). The range (or single value) is used by DataRoller to decide how many child rows to insert per parent row. In the above example, if the invoices table section has “insert 10 rows”, then the invoice_items table will end up with between 10 and 300 rows total (1×10 .. 30×10).

3.5.1Where Clause

The “child” clause also supports passing a SQL “where” clause to pick what set of rows should be affected. In the above example, all rows from invoice_items are removed, and then every row from invoices receives between 1 and 30 rows in invoice_items. This is a comprehensive action: everything removed, all parent rows used.

Using a “where” clause on the “child” portion will affect which rows from the parent table are used to attach child rows. Consider a database with a table of hospital patients, and a child table holding a record for all medications currently prescribed. Patients that are active (in the hospital) have between zero and 6 medications. Patients that have been discharged always have zero. The database should never have a discharged patient with meds:

table patients
   insert 800 rows
{
   patient_id       sequence(),
   first_name       filerow(“classpath:firstnames.txt”),
   last_name        filerow(“classpath:lastnames.txt”),
   status           choice(“ACTIVE”, “DISCHARGED”)
}

table prescriptions
   child of patients on this.patient_id = parent.patient_id
                     where “status = ‘ACTIVE’”
   insert 0..6 rows
{
   rx_code          pattern(“UU-NN-UUUUU-NN”)
}

Internally, DataRoller simply issues “select patient_id from patients prnt” when starting to populate the prescriptions table since it has a “child” clause. It loops over these records and inserts between 0 and 6 rows per patient_id returned. When a “where” clause is included like above, it appends it to the query.

In the above example, DataRoller executes “select patient_id from patients prnt where status = ‘ACTIVE’”, and then generates between 0 and 6 rows for every returned patient_id.

The “where” clause text is any arbitrary clause that can be put into a select statement. The “from” that is in effect is the parent table with table alias “prnt”. Check out the previous paragraph – it’s used there. The table alias “prnt” is included for “where” clauses that include sub-selected or otherwise needs access to the parent table being queried. For example, the following table only adds maintenance records for airplanes with over 10,000 scheduled kilometers.

table airplanes
   . . .
{
   plane_id     . . .
}

table maintenance
   child of airplanes
      on this.plane_id = parent.plane_id
      where “10000  <  (select sum(distance)
                        from   routes r
                        where  r.plane_id = prnt.plane_id)”
   insert 1 row
{
   . . .
}

3.6Unique and Unique Per Parent

Random data by itself will serve for basic testing and simple scripts, but is insufficient for demonstration data and portions of the database that will cause application logic to fail with bad data. For example:

Clauses “unique” and “unique per parent” can be added to certain generators below to prevent DataRoller from generating the same value more than once. For tables that have a “child of” clause, “unique” will never generate the same value more than once across every row in the table (regardless of parent row). “Unique per parent” will never generate the same value for all rows in the child table for a particular parent value. “Unique per parent” may generate the same value for more than one row, but only if such rows have different parent rows.


Table without “child of

Table with “child of

random(1..10)

May generate duplicates

random(1..10 unique)

No duplicates at all

random(1..10 unique per parent)

Error (“per parent” not allowed without “child of”)

No duplicates within each group of rows with same parent row.



All Generators that support uniqueness (see below for “implicit” discussion):

Generator

Supports “unique

Supports “unique per parent

random(integers)

Yes

Yes

random(decimals)

No

No

random(floating-point)

No

No

random(dates)

Yes

Yes

sequence() (all types)

implicit

Yes

column(mytbl.mycol)

Yes

Yes

mytbl.mycol

No: must use column() syntax to use unique clauses

column(parent.mycol)

No: refers to parent row of a table with “child of” clause.

column(mycol)
mycol

No: refers to column in the current row of the current table.

folder()

Yes

Yes

filerow()

Yes

Yes

xpath()

Yes

Yes

randomrow()

Yes

Yes

3.6.1Data Exhaustion

Be careful when using “unique” and “unique per parent”: if DataRoller does not have enough values to supply to the number of rows requested, DataRoller will issue an error message and stop.

table foo
   insert 1..10 rows
{
   . . .
}

table bar
   child of foo on this.id = parent.id
   insert 5..7 rows
{
   code   random(1..10)
}


There will never be a problem with column bar.code: since there is nothing unique about it, the 10 possible values may be used over and over again.

table foo
   insert 1..10 rows
{
   . . .
}

table bar
   child of foo on this.id = parent.id
   insert 5..7 rows
{
   code   random(1..10 unique)
}


Table foo will only have between one and ten rows total.

Table bar will have between five and seven rows per row in foo:

1×5 – 10×7
5 – 70

Table bar will contain between five and seventy rows! The random() generator does not have enough possible values to serve more than 10 rows.

table foo
   insert 1..10 rows
{
   . . .
}

table bar
   child of foo on this.id = parent.id
   insert 5..7 rows
{
   code   random(1..10 unique per parent)
}


Table bar may have up to seventy rows, but they are grouped into groups of five to seven. The ten available values in random() are more than enough to handle every group. DataRoller will not run out of possible values since it can reuse values across rows in foo.

3.6.2Example

These rules apply for all generators below that support “unique” and “unique per parent”. For example, a company that leases large construction equipment has several repair Teams that fix broken equipment in the field. Each Team has a mix of labor categories on them. A Team does not have more than one person with the same labor category. Each Team is assigned a single Truck to haul their equipment. Obviously, the same Truck cannot be assigned to more than one Team.

table trucks
   insert 15 rows
{
   truck_id  sequence(1)
   . . .
}

table teams     // no need to make this a child of trucks:
                // some trucks are just not used

   insert 10 rows
{
   team_id    sequence(),
   truck_id   column(trucks.truck_id unique)
}

table team_members
   child of teams on this.team_id = parent.team_id
   insert 1..5 rows
{
   employee_id     column(employees.emp_id unique),
   labor_category  filerow(“categories.txt” unique per parent)
}

3.6.3Independently Unique

Clauses “unique” and “unique per parent” apply only to the generator on which they are applied, even if two such generators appear on the same column or multiple columns on the same table. For example:

table code_parts
   insert 100 rows
{
   part     filerow(“parts.txt” unique)
}

table codes
   insert 50 rows
{
   code     column(code_parts.part unique) + “-“ +
            column(code_parts.part unique)
}

The generator “column(code_parts.part unique)” appears twice in table codes, both for column code. The fact that they are both there and both use “unique” does not mean that they are collectively unique: if code_parts contains “ABC” for column part, by chance DataRoller may generate “ABC-ABC” as a valid value for column code. All generators with “unique” are independently unique: they don’t talk to each other.

3.6.4Sequences

Sequences always generate unique values, and so do not support “unique” explicitly (you can’t put “unique” on them: they are already unique). There are no non-unique sequences such as a “wrap-around” sequence that might start over from 1 after hitting a maximum value.

Sequences do support the “unique per parent” clause. A “unique per parent” sequence will start over and reset itself back to its starting value for each new parent row.

Consider a company that wants to generate purchase numbers based on the supplier’s code. The first P.O. for each supplier has a “1”, then a “2”, and so on. Any P.O. number will therefore have the supplier it applies to, and a sequence showing which P.O. it is for that supplier.

table suppliers
   insert 80 rows
{
   label  “S” + format(sequence(1), “00”)
}

table purchase_orders
   child of suppliers on this.id = parent.id
   insert 0..10 rows
{
   id      sequence(1),
   po_num  parent.label + “-” +
           format(sequence(1),
                  “000”)
}

Generates P.O. numbers:

S01-01
S01-02
S01-03

S02-04   
we wanted “-01” here!
S02-05
S02-06

S03-07   
we wanted “-01” here!
S03-08
S03-09

Not what we wanted.

table suppliers
   insert 80 rows
{
   label  “S” + format(sequence(1), “00”)
}

table purchase_orders
   child of suppliers on this.id = parent.id
   insert 0..10 rows
{
   id      sequence(1),
   po_num  parent.label + “-” +
           format(sequence(1 unique per parent),
                  “000”)
}

Generates P.O. numbers:

S01-01
S01-02
S01-03

S02-01
S02-02
S02-03

S03-01
S03-02
S03-03

Each supplier gets P.O. numbers with their label, starting at 1.

3.7Columns

After a table definition (“table invoices delete all insert 100 rows”) are a collection of zero or more Column Definitions separated by commas, all within one set of curly braces (“{” and “}”). Each Column Definition has the name of a column to populate with data, and an expression that describes how DataRoller should generate data for that column. DataRoller Variables can be set as well, but see the Variables section below for details.

A sample Column Definition:

ncd_code          “AB-” + pattern(“UUUNNNNNN”) + format(sequence(1), “000”)

ncd_code” is the name of the column and everything else is the Generator. The column name does not appear in double-quotes, but you can surround it with “[” and “]” if your column name is the same as a DataRoller keyword (like “sequence”).

“Generator” is a broad term in DataRoller for anything that can generate data suitable for inserting into the database. Generators each can generate dates, strings, and numbers following a random pattern, or more organized logic. A Generator Expression is a bunch of Generators combined with things like “+” for string concatenation or numerical addition.

Column values can be broken down into two groups:



DataRoller constructs a SQL insert statement for each table as follows:

  1. Each column in the DataRoller file is added as a column name in the SQL insert statement. Note that the local foreign key to a parent table is also included in the insert statement.


  1. Each column contributes one “?” to the values clause of the SQL insert statement. All generators and parent table definitions contribute a “?” that is bound on a per-row basis to whatever the Generator generates. Raw() is the exception: the argument to raw() is used directly in the SQL insert statement instead of a “?” for later binding.



Because raw() is used to construct the SQL insert statement directly, raw() cannot be combined with any Generator or appear in any fashion other than by itself as the only portion of the column.

3.8Data Types

3.8.1Strings

Strings are any sequence of characters surrounded by double quotes. The usual escapes are supported:

Escape

Value

Description

\\

\

Backslash escapes the next char, another backslash.

\r


ASCII 0x0D, the carriage return (non-printable).

\n


Newline

\t


Horizontal tab. There is no “\v” available for vertical tab. Does anyone use vertical tabs anymore?

\"

"

Escape the double-quote. There is no “\'” to escape a single quote. Use the single quote directly in the string.

\u00a7
\u2264

§

Unicode characters can be included with “\u” followed by exactly four hex digits (case-insensitive).

Newlines (carriage returns and linefeeds) are valid characters in their own right, so you don’t have to use “\n” if you don’t want to. Be careful, because this means forgetting to put a closing double-quote at the end of a line does not generate an error. All text until the next double-quote character will be included in the string, leading to unexpected errors flagged nowhere near where the root cause is!

String Literal

Exact Value

Comments

"Hello World!"

Hello World!

Simplest string, nothing fancy going on.

"line1\nline2\nline3"

line1
line2
line3

The “\n” put actual newlines into the
string literal.

"line1
line2
line3"

line1
line2
line3

A string may span multiple lines for embedding long SQL, for example. Newlines are preserved.

"_'thing1'_\"thing2\"_"

_'thing1'_"thing2"_

Escaped double quotes. Single quotes do not need to be escaped.

"Use \"\\t\" for tab"

Use "\t" for tab

\\ is a single \ and the t is just a t.

3.8.2Integers

Integral numbers are any whole number, negative, zero or positive. Internally, all integral numbers are stored as Java longs, so you have an effective range of a negative number of 19 digits to a positive number of 19 digits.

By default, all integer literals are assumed to be expressed in base 10 notation. Prefix a literal with “0x” to use hex (0-9 and A-F). Octal is supported by appending “_8” and binary is supported by appending “_2”. The “_” is reminiscent of a subscript notation from typesetting language TEX. There are no other bases supported with the “_” notation currently. Note that “0123” is a decimal number, not octal in this scheme!

Integer Literal

Base 10 Value

Description

123

123

Numbers are base-10 by default.

-560000

-560000

Negative numbers as expected.

0xFf

255

0x” is hex prefix. “F” and “f” are both 15.

-0xa8

-168

Negation comes before the “0x”.

123_8

83

1238 = 8310

-123_8

-83

Negatives are the same as base-10 above.

1010_2

10

10102 = 1010

-1010_2

-10

Negatives are the same as base-10 above.

Note that Decimals and Floats below are only expressible in base-10 notation.

3.8.3Decimals

Decimals are real numbers with a fixed number of decimal places (no repeating decimals like 0.3333333…). Decimals are expressed as 1 or more digits, a period and one or more digits, such as “1.0” or “-559.0021837”. Decimals are used for database columns like DECIMAL(10,5) and NUMBER(9,2): those types that have perfect precision and no round-off errors like typical floating point numbers.

Be careful with Decimals! They have no ability to represent irrational numbers or numbers with repeating decimal digits like the number “one third” (1÷3). The number “1.3333…” is not a Decimal.

Decimals are internally represented by a 32-bit integer holding the mantissa and a 32-bit integer scale that holds where the decimal point is in relation to the whole number mantissa. The advantage of Decimals is that they do not suffer from rounding or inexact mathematics. Ever add 0.000001 to 0.000002 and get 0.000002999999999 with IEEE floating point types? …not with Decimals!

3.8.4Floats

Floats differ from Decimals in that they directly correspond to the internal floating point types of the Java language and the underlying operating system. Floats are only applicable for database types such as “float”, “real” and “double”. Using a Decimal DataRoller type for a database double column will result in potential rounding problems. Using a Float DataRoller type with a database DECIMAL column may result in over/underflows and inexact results.

Floats are represented using the scientific notation of a mantissa, “e” and the power-of-10 exponent, either potentially negative. The use of the “e” notation is the defining syntax for Floats. Examples: “0e0”, “0.0000005e189”, “1231321.77e-90”.

Mathematically, , so if you don’t like scientific notation, just add “e0” to the end of a number to make it a floating point literal. I.e., write 123.456 as a float in a DataRoller file as “123.456e0”.

3.8.5Dates and Times

Date literals start with D' and end with a ' character. Note that double-quotes are not used (they delimit strings). Values within the delimiters generally follow the ISO date and date-time structure of “yyyy-mm-dd”. Timestamps use the date format with “T” and the time in “hours:minutes” and optional “:seconds”. Hours are in 24-hour notation.

In addition to the above, the keywords today, yesterday, tomorrow and now are valid as a Date and Timestamp with the same delimiters as above. D’now’ is literally this very second. D’today’ is the date portion of D’now’. If you use today for a database column that supports a time component, the time component is midnight, meaning today always predates now except at exactly midnight on the wall clock, when they are equal.

There currently is no time-only data type supported.

Examples:

3.8.6Booleans

Valid Boolean constants: “true”, “false”.

Example:

Given a PostgreSQL table definition

Fill it with values via

create table abc
(
    foo   Boolean,
    bar   Boolean
)

table abc
{
    foo   true,
    bar   false
}

Note that true and false are not enclosed in quotes: they are not strings, they are constant values of type Boolean in DataRoller.

3.8.7Nulls

Not terribly interesting, but “null” is a keyword in the language and will insert a NULL database value when used explicitly.

3.8.8Unsupported Data Types

There is no “bit”, enumeration or MAC address or other data types within DataRoller. This does not mean that you cannot store them into a database. JDBC drivers and databases themselves are flexible in converting types when they are unambiguous.

For example, you can do this to a macaddr column and the database will convert the string into the internal representation:

insert into mytable (mymac) values ('08:00:2b:01:02:03')

There are two approaches to handle unusual circumstances, such as populating an Oracle Spatial database with columns of type SDO_GEOMETRY:

  1. Write or use an existing stored procedure in the database to convert from a simple data type above, into the database-specific internal representation. For Oracle above, the PL/SQL function SDO_GEOMETRY(varchar2, number) already exists.

  2. Use a User-Supplied Java function in DataRoller to accept simple types from above, and return a vendor-specific internal object, such as an object of type oracle.sql.ARRAY for structured types in Oracle.

3.8.9Null Expressions

Each Column Expression has an optional Suffix, which is of the form “null ___%” where “___” is an integer between 0 and 100. “null 5%” means approximately 5 out of every 100 rows generated by DataRoller will be NULL values. “null 0%” is the same as not specifying the suffix at all, and “null 100%” is the same as omitting the entire Column Expression.

DataRoller uses a simple random number generator to decide when to leave a column null. This means for small values of “insert ___ rows” on a table, DataRoller will not generate the correct percentage of null values.

3.9Variables

Variables in DataRoller are un-typed places to store pre-computed values for use elsewhere in a DataRoller project file. To declare a variable, jut assign it a value – there are no separate declaration statements. To reference a variable, just mention it. References to variables that have not yet been assigned are taken to have a value of null.

Variable names start with a dollar symbol (“$”) and contain one or more letters or numbers. “$12345”, “$abcde” and “$c3p0” are all valid variable names. Labels used to name columns may also contain letters, numbers and optional dollar signs, so a column name that starts with a dollar sign must be escaped using the square bracket syntax: “$abc” is a variable name, regardless of where it appears. “[$abc]” is a label (such as a column name) regardless of where it appears.

3.9.1Assignment

Assign values using the equals sign (“=”) anywhere within a table’s columns. DataRoller will evaluate each column’s values in the sequence they are listed in the project file, including the occurrences of variable assignments. DataRoller re-computes each column and variable assignment for each row it generates.

For example,

table testVars
   insert 3 rows
{
   $n = sequence(5),   col1   $n,
   $n = $n + 1,        col2   $n,
   $n = $n + 1,        col3   $n
}

will produce the following rows in table testVars:

col1

col2

col3

5

6

7

6

7

8

7

8

9

The first assignment to $n is sequence(5), which will hand out numbers 5, 6, 7 and so on. Sequence() is unrelated to any other assignment in this table: DataRoller simply returns one more than the last value sequence() returned from the last row it generated.



Anything you can assign to a column can be assigned to a variable, including the “null 10%” syntax at the end. This includes expressions like “casewhenelse”, any function, any operator such as “+” or “*”, or any reference to any variable (including the one on the left-hand side of the assignment).

All of the following are valid variable assignments:

Assignment

Description

$a = $b null 50%

Sometimes whatever is in variable $b, sometimes null.

$a = pattern(“B” * random(1..6))


$a = products.prod_id

Picks random values from table products column prod_id

$a = name != “root” && group != “wheel”

References columns name and group in the current row being generated

$a = if (haspreviousrow())
        “DEFAULT”
     else
        column(policies.label unique per parent)
     end

Assuming this is in a table with a parent, this will pick a value for $a of “DEFAULT” for the first row per parent table row, and then pick a random policy label (unique) for each subsequent record per parent row.



Variables hold a single, scalar value until they are assigned another value. There is no local scope to any variable, and they have an extent from when they are first assigned to the end of the entire DataRoller project file. Put another way, variables are global across table sections and hold their values from table to table.

For example, the following will always put a new value one greater than the previous into table tree, even though this table is included twice in the project file.

table tree
   insert 4..9 rows
{
   $seq = sequence(),
   id      $seq
}

table tree
   insert 10..20 rows
{
   $seq = $seq + 1, // has last value used in table above to start!
   id        $seq
}

3.9.2Variables and randomrow()

The generator randomrow() picks a row at random from the table (or lookup table) specified. The actual value it returns is a DataRoller internal type that is generally not useful other than with the special variable syntax below. Assign a variable to randomrow(), and then reference columns from this random row via the “.” syntax:

table myproducts
   child of person on this.person_id = parent.id
   insert 5 rows
{
   $prod = randomrow(products),

   prod_id      $prod.id,
   prod_name    $prod.name,
   prod_price   $prod.price
}

DataRoller will pick a row at random from table products and store all columns in the $prod variable during the assignment statement. After this assignment statement, the variable $prod contains these column values until $prod is assigned another value (like when the next row in table myproducts is generated). This means that each use of variable $prod above (lke $prod.id) will refer to columns from the same row in table products. This is a great way to coordinate values between two tables, such as in associative tables.

The “bar” in the “$foo.bar” syntax is a column label like any other. Escape a column that contains special characters using square brackets, such as “$foo.[$bar]” to reference column “$bar” in variable “$foo”. Be careful to include the “.” to separate the two portions. “$foo[bar]” is not a valid syntax in DataRoller: square brackets are used to escape labels, not indicate array reference.

The “$foo.bar” syntax cannot appear on the left-hand side of an assignment: DataRoller does not allow assignment into a random row picked via randomrow() via “$foo.bar = 123”.

3.10Embedded SQL

Operational databases are often constructed to help the application maintain valid data “state” through things like check constraints and triggers. Any modification to data in the database outside the normal range of operation expected by the application can lead to real problems. For example, having triggers on table “orders” that insert history records into table “orders_hist” are effective, as long as the target application performs normal business modifications to the orders table.

There is no expectation that these triggers maintain the history when “delete from orders” executes. DataRoller can perform disruptive, wide-spread changes to a database outside the bounds of what application triggers were designed for.

Before running DataRoller, it is a good idea to disable or remove these types of database-resident code so the database is “opened up” to non-operational access. When DataRoller is complete, put the triggers back.



In a similar vein, deleting all data from multiple tables and reloading them with a data set of an entirely different character will render a database inefficient: table and column statistics are not always updated along with data changes, and a table that does not match its pre-determined statistics will likely produce inefficient queries.

To wit, an orders table with a million rows and a products table with 100 rows will likely start an inner-join on the products table first, and then perform index unique searches on the orders table. If DataRoller removes all records from products and orders, and then loads up a million products and only a few orders, inner joins will still start on the products table taking up dramatically more resources than needed.

Use DataRoller to replace statistics on tables after new data has been loaded.

3.10.1Embed SQL in DataRoller File

The method to perform the above is by embedding SQL statements to execute as part of the DataRoller Project execution. Embedding SQL is easy: use keyword “sql” followed by the statement in double-quotes. Include a single statement per sql keyword.

sql “create function change_date (d date) returns date return d - 1 day"
sql "set database event log level 0"


sql "set table fireworks read write"
sql "drop index i_fireworks_code"

table fireworks
insert 100 rows
{
date_lit random(D’2009-4-12’ .. D’today’ step /hour/),
bang sql("values change_date(?)" bind date_lit),
product_code sequence()
}

sql “create index i_fireworks_code on fireworks(product_code)”
sql "set table fireworks read only"

// continue on with other tables ...

sql "drop function change_date"
sql "set database event log level 2"

This uses embedded SQL to

  1. Embed a function in the database that can help with inserting test data, and then remove it when DataRoller is done executing.

  2. Turn logging in the database way down so the event logs do not fill up while every row is deleted and a new set is inserted.

  3. Allow modifications to table fireworks, which is usually read-only.

  4. Remove an index and rebuild it afterward to speed up execution of DataRoller.

3.10.2Reference External SQL File

Embed simple SQL statements directly in a DataRoller project file with the sql keyword. Collect larger SQL statements, blocks of code or larger sequences of statements in a separate “.sql” file and reference this file from within the DataRoller project file.

Each separate SQL statement is executed within an external SQL file. DataRoller does not parse the SQL syntax or otherwise understand the structure of an external SQL file. It makes rudimentary decisions about where one statement ends and another begins, and can only handle a limited set of comment placements without being confused.

Rules for assembling an external SQL file for DataRoller:

Example:

sql "insert into states (abbrev, name) values ('ca', 'California')"

table states
insert 2 rows
{
...
}

sql file "cleanup.sql"

File cleanup.sql:

-- Fix any abbreviations that are lower-case
update states
set abbrev = upper(abbrev)

#
# comment
#
create index i_states_abbrev on states(abbrev);

3.11A Note on File and Folder Names

Several Generators require the name of a file or folder to read. The examples above show a simple file-based string like “C:\data\myfile.txt”. DataRoller has an internal syntax for describing files and folders that extend this to support looking in zip files and searching the Java classpath.

Basics of DataRoller Files and Folders:

  1. If it starts with “classpath:”, everything else is searched in the Java classpath

  2. If it contains an exclamation point (“!”), the left-hand side is a zip file to open, and the right side names the thing in the zip file to retrieve.

  3. If DataRoller is expecting a folder, but a zip file is specified, DataRoller will open the zip file and consider all files in the zip archive to be the files in the folder to use.

  4. If the above do not apply, just consider the string to be a file or folder on disk.



Things in DataRoller that work this way:

Construct

Context

Examples

lookup table

File

lookup table mytbl = “prod_dump.zip!recs.xls”
Read recs.xls from the zip archive prod_dump.zip and use it as a pseudo table in DataRoller.

folder()

Folder

folder(“C:/mydata”)
Open folder C:/mydata on disk and read in all files.

folder(“C:/mydata.zip”)
Open the zip file and consider its contents to be the files to
process.

filerow()

File

filerow(“classpath:data.txt”)
Search the classpath for file data.txt and read it in.

xpath()

File

xpath(“a.zip!b/c/d.xml”, “/root/child”)
Read the file d.xml in folder b/c/d inside zip archive a.zip as a source of XML.



When using “!” to name a zip file and an internal entry, the zip file can be an absolute or relative path. Zip files with a relative path (or no path) are rooted in the same folder as the input project file. This way, a project file can be kept with all zip and other files that it depends on. The entire directory of files can be moved around together.

3.12Lookup Tables

The job of generating test data or demo data is rarely an exercise in filling random bits and bytes into columns. One of the simplest ways to fill a database with production-looking demo data is to go raid the production database! DataRoller has the ability to reference Excel spreadsheets as if they were tables.

For example:



Create an alias within DataRoller called “mytbl” that refers to the first worksheet in file “people.xls” which resides in ZIP file “myfile.zip”:

lookup table mytbl = “myfile.zip!people.xls”

Because of the way other parts of the system work, don’t pick table aliases that are the same as real tables in the database. The lookup definition here will make the real database table unreachable in things like column() and randomrow().



Excel files must be “.xls” format, not “.csv” or “.xlsx”. This is another way of saying the file must be in “Excel 97-2003 Workbook” format. The first worksheet will be used, and all other worksheets will be ignored. The first row of the worksheet should contain the column names, and all other rows should contain the data.

Because of a limitation in Excel internal format, DataRoller cannot easily tell the difference between floating-point non-precision real numbers, “decimal” precision numbers and integers. So, DataRoller takes the most liberal interpretation and reads in all numbers as floats. To override this behavior, suffix the header cell name with “#” and a code from the table below.

Suffix

Meaning

#D

Assumes each cell in the column is a number representing a Date or Date-time value.

#F

Default for numeric columns, uses a floating-point value for each cell in the column.

#I

Assume each cell in the column is an integer number.

If a cell is formatted in Excel as a Text cell, any suffixes above are ignored and the cell value is taken as a string.

4Generators

Generators come in several flavors:

Random Data: Generator that picks values randomly from a range you specify.

Structured Data: Generator creates data following a plan you describe.

Lookup Data: Data pulled from another column, file on disk, or database location:

Operators: Infix symbol that combines two Expressions into a new Expression:

Conditionals: “if then else” logic to adjust processing based on conditions:

Functions: Java code that creates new data from zero or more Generator Expressions:

User Functions: Java code written by you and invoked by the DataRoller engine:

4.1Random Data

The hallmark of all random value Generators for basic types is “random()” with values using “..” syntax:

For obvious reasons, the first term has to be less than or equal to the second term, or the range is invalid. Values are taken at random from the range, including both the lower and upper bound values (the range is inclusive). Duplicates are likely generated. The distribution of random values is approximately linear, given a large enough number of rows to generate. For small row counts, the distribution is likely non-linear, and many valid values in the range may not be generated while others are generated more than once.

There is no version of random() that supports Boolean types; use choice(true, false)instead.

4.1.1Random Integers

In addition to the basic “random(1..10)” notation, a step can be included to change which values are generated. “random(1..10 step 2)” means 1, 3, 5, 7, and 9 are generated: start at the low bound, and only pick values that are reachable by 2s. Note that “step 2” has nothing to do with even, or “divisible by 2”: it is something like an increment value. “random(2..10 step 2)” generates random values in the set 2, 4, 6, 8, 10.

DataRoller will never generate a value outside of the low and high bounds, even if the step doesn’t exactly land on the high bound. For example, “random(6..14 step 3)” generates 6, 9, 12 and never 15 because 15 > 14.

The step can be any positive value (zero and negative numbers are not allowed), even one larger than the entire range. The expression “random(89..226 step 10000)” generates random values starting at 89 and all values 10000 from that initial value up to value 226. In other words, it always generates the single value 89 because 89 + 10000 > 226. This has the side-effect that no value for step will ever lead to an empty set of values to pick from at random. The lower bound is always in the set of valid values.

If a random integer expression is suffixed with “unique”, it will never generate the same value twice. If the number of valid possible values is less than the number of rows to be generated for the current table, an error will result when this Generator runs out of values.

When both “step” and “unique” or “unique per parent” are included, “step” must appear first.
The following are valid:

4.1.2Random Decimals

Random decimal values are syntactically the same as integers above with decimal literals, except:

The step clause is required because it is not always obvious what the implied step is: “random(1.00 .. 1.4)” could be {1.0, 1.1, 1.2, 1.3, 1.4}, or something much larger such as {1.00, 1.01, 1.02, 1.03 .. 1.38, 1.39, 1.4}.

4.1.3Random Floats

Due to the continuous nature of Float values, “step” is not supported. “1.54e-106” as a step could easily lead to round-off values, resulting in strange results. This is also against the general use of floating-point, continuous numbers.

Specify random float values with a simple low … high range, such as “random(1.1e4 .. 2.2e5)”.

Both “unique” and “unique per parent” are not supported for floating-point ranges.

4.1.4Random Dates and Timestamps

Dates and Timestamps contain discrete values and therefore support “step” and “unique” clauses, but their syntax is different than above. The date or Timestamp range is just a low value “..” high value. The range is a notation of which dimension is the “cut-off” for the set of valid values. I.e., “step /hours/” means generate values such that the minutes and seconds components are always zero. The forward slashes are required.

The default step is “step /days/”, meaning the time is always zero (midnight).
random(D’1900-1-1’ .. D’1900-12-31’)” means generate Timestamps between those two dates, with hours, minutes and seconds zero.

The valid values for step are “/years/”, “/months/”, “/days/”, “/hours/”, “/minutes/” and “/seconds/”. All values supplied in the low and high bound must not have a value. For example, “step /hours/” means you cannot specify a value for minutes or seconds in the low or high bound. Since some step values (like “/years/”) run into syntax problems with the date literals, “1” can be used for month and day.

Step

Template for Low and High bounds

Interpretation

Seconds

None – use any date, unrestricted


Minutes

D’nnnn-nn-nn
D’nnnn-nn-nnTnn:nn
D’nnnn-nn-nnTnn:nn:00

Seconds must be zero when specified.

Hours

D’nnnn-nn-nn
D’nnnn-nn-nnTnn:00
D’nnnn-nn-nnTnn:00:00

Seconds and minutes must be zero when specified.

Days

D’nnnn-nn-nn
D’nnnn-nn-nnT00:00
D’nnnn-nn-nnT00:00:00

Entire time component must be zero when specified

Months

D’nnnn-nn-1
D’nnnn-nn-1T00:00
D’nnnn-nn-1T00:00:00

Because the syntax for a date does not allow omitting the day, set the day to 1. I.e., “May 12” is not a good bound when you only want months. The date returned always has a day-of-the-month set to 1.

Years

D’nnnn-1-1
D’nnnn-1-1T00:00
D’nnnn-1-1T00:00:00

Like above, use “1” for the month number. All dates returned have a day+month of January 1st.

Don’t forget that “today” always has a time component of all zeros. “now” is this exact point in time, and so probably has a non-zero value for hours, minutes and seconds. This means “random(D’today’..D’now’ step /minutes/)” or any step other than “/seconds/” will fail because today will have at least one non-zero time component.

Both “unique” and “unique per parent” clauses are supported and may only appear after the step clause.

4.1.5Choice

The choice() generator will pick a value at random from the list of explicit values included directly in the declaration. Specify a list of any valid values, and choice() will pick values at random. The default distribution is linear: each value listed in choice() is picked with equal probability. Each item listed in choice() can also include a relative weight so some values will be chosen more frequently.

If no weight is included, it is assumed to be 1.

Examples:

choice( 0xff0000,
0x00ff00,
0x0000ff )

Pick any of the three values at random, each with equal probability.

choice( “open” weight 6,
“closed” )

“open” chosen around 6 times more frequently than “closed”.

choice( “A” weight 8,
“B” weight 3 )

8 + 3 = 11, so 8 times out of 11 “A” is chosen and 3 times out of 11 “B” is chosen.

choice( “Jan” weight 31,
“Feb” weight 28 )

Using this same approach, pick a month based on the distribution of possible days. January has 31 days in it, and February usually has 28. So, “Jan” is chosen slightly more frequently because there are more days in January then February.

4.1.6BLOB

Database column types accepting binary data include binary, long, long raw, blob, image and others. The primary way to put binary data into these fields is to use folder() above with a folder full of non-text files. This approach leaves the database in a state where the application can still execute. Imagine the results of putting random bits into a column that the application considers an image! The application would not function correctly.

Use the blob() Generator for creating massive amounts of random bytes to store into database columns that you don’t care about. This is a good approach for truly maxing-out a database in preparation for load testing: firing queries at a database with gigabytes of random data to chew on.

Pass the number of bytes to generate as a range to blob().

Example:

blob (1000000 .. 1000000000) // generate between 1MB and 1GB of random data

Implementation Note. DataRoller does not stream random data into the database. It creates the entire blob byte array in memory first, and then sends it to the database. This means the complete generated BLOB value exists in memory. Asking for a BLOB of size 30 GB will likely crash your Java VM without special consideration.

4.2Structured Data

4.2.1Lorem Ipsum

Lorem Ipsum refers to generated nonsense words that are organized in a way that resembles a natural western language without diacritical characters. Sentences start with capital letters and end with periods. Sentence lengths tend to follow general western word counts. Lorem Ipsum is used to populate areas that should hold prose without being distracting to the eyes: the text does not intrigue the “reader” or pull their gaze from the overall appearance of the screen.

Sample Lorem Ipsum text:

“Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut mauris augue, fringilla at lacinia vitae, convallis in est. Sed molestie dictum sem, eget pulvinar est fermentum sed. Vestibulum eros ipsum, interdum id pulvinar et, vehicula ac dolor. Integer viverra, libero vitae egestas scelerisque, odio lorem dictum lectus, id egestas nibh felis mollis augue.”

Use “lorem()” to generate random strings to populate descriptive database columns. Pass a range to use for the generated lengths, such as “lorem(50..900)” to generate a string of between 50 and 900 characters that has sentences of random-looking text.

This Generator does not just spit out the same text each invocation. There are some nice subtle features in here:

4.2.2Sequences

Sequences serve to return a different value for each row, each value one greater than the last. For numbers, this is straight-forward. For Dates and Timestamps, it uses an increment time duration for the same effect.

Integer Sequences like “sequence(100 step 10)” start at 100 and count up with each subsequent value 10 larger than the last. The step is optional and defaults to 1. If both the starting value and step are omitted (“sequence()”), a starting value of 1 and step of 1 are used.

The step value may be positive or negative, but not zero.

Sequences do not “wrap” or stop at zero. “sequence(3 step -1)” returns {3, 2, 1, 0, -1, -2, -3, …}.

Decimal and Float Sequences must always supply a step.

Because sequences generate a new number each invocation without wrapping, they are implicitly unique. The “unique” clause is not supported as it would be redundant.

The optional “unique per parent” clause is supported. A sequence with “unique per parent” will reset back to the starting value for each new parent row created.

4.2.2.1Date Sequences

Date sequences must supply a starting Date or Timestamp and may optionally include a step. Steps specify a Time Duration that is added to each previous value to compute the next value. Quick example of a sequence with a Time Duration:

sequence(D’1920-7-19’ step /7 days 6 hours/)

When omitted, the default Time Duration is “/1 day/”.

A Time Duration can specify a value for years, months, days, hours, minutes or seconds, or any combination of the above. Multiple entries must appear in coarsest to finest values. I.e., years before anything, days before hours and so on. “/1 month/” applied to D’2000-1-1’ will yield D’2000-2-1’, D’2000-3-1’, D’2000-4-1’, D’2000-5-1’, and so on. Notice that the day component of the date remains the first day of each month! A month in general concept is not always the exact same duration. For DataRoller “/1 month/” means move forward to the next month with the same day. This same concept applies to years (not always 365 days).

When incrementing by “/1 year/” for example, there is no requirement that all finer components be zero. So, “sequence(D’1994-2-14’ step /1 year/)” will return Valentine’s day on each subsequent year.

This works nice for things like “quarterly”: “sequence(D’2003-1-1’ step /3 months/)” returns January 1st, then April 1st, then July 1st, then October 1st each year (the first day of the next quarter).



A fully-specified Time Duration:

/ 1 year -2 months 3 days -4 hours 5 minutes -6 seconds/

To compute the next value in a sequence, the previous Timestamp is moved by each component of the Time Duration. That is, the previous value is moved forward 1 year, then back 2 months, then ahead 3 days, and so on.

All entries in a Time Duration must not cancel each other out. For example, the following are all invalid Time Durations because they will not have a net effect on the previous value when computing the next value:

Expression

Meaning

/1 year -12 months/

Always cancel each other out

/1 month -30 days/

Only works when you cross from one month to another that has 28, 29 or 31 days.

Causes an error for April 1st: 4/1/2000 + 1 month = 5/1/2000 – 30 days = 4/1/2000. This would always generate the same date!

/-1 day 24 hours/

Always cancel each other out

/1 month -29 days/

This will generate values successfully until it encounters a leap year, where it will fail when it crosses February.

/-1 year 365 days/

Will fail on all common years (non-leap years).

/1 hour -59 minutes -60 seconds/

Will always fail: all three values always add up to a net of zero.

Hours, minutes and seconds always relate in 1:60:60 ratio, so there are no interesting cancellations like leap year above.



Note that the order in which components are applied is significant in a few miserable cases:

Specification

sequence(D’2010-6-1’ step
/1 month -30 days/)

sequence(6/1/2010 step
/-1 month 30 days/)

Start with the first date

2010-6-1

2010-6-1

Apply the Month first

2010-7-1

2010-5-1

Now apply the Days

2010-6-1

2010-5-31

Result

Same as start! Error!

Not the same, no error

4.3Column

Generator column() can be used to reference a value in another column, optionally in another database table or lookup table. Column() is a collection of similar generators that all share a common syntax.

Column generators:

Column Generator

Use

column(gate_num)

gate_num

Current-Row Reference
Evaluates to the value in the current table in the current row that was just generated.

The column() can be omitted.

column(gate_num.level)

gate_num.level

Separate Table Reference
Evaluates to a random value from the column in a separate table or external file (lookup table).

The column() can be omitted.

column(parent.level)

parent.level

Parent Table Reference
Evaluates to the column in the specific row in the parent table. This is not the same as the Separate Table Reference above! See below for details.

The column() can be omitted.

As a shortcut because the above column references are so prevalent in DataRoller files, the “column()” syntax can be omitted as long as “unique” is not present. The column() syntax is required when the “unique” or “unique per parent” feature is used.

table products
child of names on . . .
insert 10 rows
{
order_id sequence(),
inv_type pattern(“U”),
invoice_num “INV-” + inv_type, // inv_type references col above
prod_id apps.id, // id col in table apps
customer parent.last_name // last_name column from names table
}

A column name or table name that clashes with a DataRoller reserved keyword can be escaped with square brackets:

table [sequence]
child of ...
insert 10 rows
{
[bind] sequence(),
key “A” + format([bind], “000”),
customer parent.[folder]
}

In Diagram Form

Table names refers to an external Excel sheet.

parent.brkcode references the single parent row in table brackets for every row in orders, column brkcode.

inv_type by itself refers to the inv_type column in the current row being generated.

products.id refers to a random value for column id in table products.

names.last_name refers to a random value from Excel sheet names (see lookup table at top) column last_name.

4.3.1Current-Row Reference

The simplest invocation column() is to name another column in the table currently being processed. This will simply copy the value from that column in the current row being assembled by DataRoller. I.e., this value exists in DataRoller and will soon be in the database after DataRoller has assembled all values for the current row. Because of this, do not name a column that appears in the current table further down the DataRoller project file: the value for subsequent columns have not been generated at the time that column() is processed. Obviously, column() cannot refer to itself.

Neither “unique” nor “unique per table” are supported since a Current-Row Reference applies only to the current row about to be inserted.

4.3.2Separate Table Reference

To reference a column in another table, invoke column() passing both the table and target column with a “.” separator. For example, column(products.name) refers to column name in the products table. When invoking column() with a table and column name, column() picks a random row from the target table and uses that column’s value.

The table name can be the alias for a lookup table defined at the top of the DataRoller project file, or a real table in the database. When the first invocation is constructed, the entire target table is read into memory, and this memory copy is used for the rest of the DataRoller project run. This means referencing very large Excel files (lookup tables) or database tables with a large number of rows or large columns (LOBs) may lead to memory problems.

Because tables are completely read into memory upon first need, self-referencing tables will not work the way you might think. Consider this:

table employees
   delete all
   insert 100 rows
{
   emp_id   sequence(),
   name     filerow(“names.txt”),
   boss_id  column(employees.emp_id)   // doesn’t work the way you want this to!
}

This is an attempt to build a random hierarchy by putting a previously-generated primary key into the boss_id field. While building the first row to be inserted into the database, DataRoller will cache the entire employees table in memory, which currently has zero rows (note the “delete all” above!) and so will cause an error.

The sql() generator might be considered as a possible solution too:

table employees
   delete all
   insert 100 rows
{
   emp_id      sequence(),
   name        filerow(“names.txt”),

               // probably not what you want:
   boss_id     sql(“select emp_id  from employees  order by RAND()  limit 1”)
}

This too is problematic. DataRoller will batch up insert statements and send them to the database in groups. The script above will only consider rows already in the database. DataRoller will create the first fifty rows all of which have null for boss_id since no data has been written to the database yet. When DataRoller creates row number fifty-one, there will then be fifty rows in the table for consideration.

Both “unique” and “unique per parent” are supported. When using one of these clauses, the column() syntax must be used. “column(model_types.name unique)” is correct; just “model_types.name unique” is a syntax error.

4.3.3Parent Table Reference

A table definition that has a “child of” clause may use “column(parent.foo)” where foo is the name of a column in the table listed in the “child of” clause. “parent” is a DataRoller keyword and is required.

Any column in the parent table may be used, even columns not present in the DataRoller project file! DataRoller will execute a separate SQL select statement to retrieve a value for every row in the child table. This is significant if the parent table has triggers that create or alter data in the database table, or if there is a SQL statement after the parent table in the DataRoller project file that alters column values.

Consider the following example:

table accounts
   insert 100 rows
{
   username           pattern(“L”) + filerow(“names.txt”),
   password_plain     pattern( “L” * random(8..25) )
}


sql “update accounts set password_cipher = PASSWORD(password_plain)”


table login_probs
   child of accounts on this.username = parent.username
   insert 0..1 rows
{
   expected_passwd   parent.password_cipher,
   actual_passwd     pattern(  choice(“L”,”U”,”N”) * random(8..10)  )
}

The accounts.password_cipher column is computed by calling a database-resident function PASSWORD() on the password_plain column (maybe because DataRoller does not have this ability). Table login_probs references this via the parent.password_cipher reference, even though column password_cipher does not appear as a column in the DataRoller file above. DataRoller simply issues a SQL select similar to the following:

select password_cipher from accounts where username = ?



Using the parent keyword is not the same as just using the actual table name of the parent table. “parent.foo” is nothing at all like “names.foo” even when the current table has clause “child of names”. The parent keyword means that DataRoller will follow the foreign key in the child table to the parent key in the parent table, and select the column in question. Using the actual table name of the parent table will select a value from a random row in the parent table, regardless of the foreign key in the child table.

In the example at right, parent.person and pledges.person appear to reference the same thing in parent table pledges. They do both name the same column in the same table, but “parent” keyword has different semantics from just using the name of the table.

parent.person copies the value from the parent table, and then initials() converts the full name to just the 2 or three starting letters.

pledges.person just picks a random value from table pledges.

Also note that “initials” is the name of a function, but function names are not DataRoller keywords, so the column named “initials” does not need to be escaped as “[initials]”.

DataRoller processes each table in the project file to completion before starting the next table in the project file. This rule also applies to tables with “child of” clauses: DataRoller will complete all processing for the parent table first, and then continue with each child table separately. This is important if you directly reference a parent table in a child table via the parent table name (the “pledges.person” above instead of “parent.person”). Using “pledges.person” has every row from table pledges available to it.

A table in a DataRoller project file may have more than one child table in the project file.

Both “unique” and “unique per table” clauses are not supported for Parent Table References since they directly reference a single row in the parent table.

4.4Lookup Data

4.4.1File Row Lookup

Filerow()will read in the entire contents of a file passed in, and pick rows at random to insert into the database. The file specified must be a text file with typical line breaks (UNIX or DOS format).

Filerow()is typically used with a large file of “boilerplate” text or common words and phrases like last names, cities, countries and so on.

Both “unique” and “unique per parent” clauses are supported for filerow().

4.4.2Folder Contents Lookup

The folder() Generator will retrieve the contents of files on disk and use them as values to insert into database columns. This is a good choice for columns that store images, Microsoft Word or Excel documents, large XML strings, or other structured data that you might have on hand. The Generators above are designed to create data from nothing based on a structure you put into the DataRoller Project file. Folder Contents Lookup and other Generators below are designed to use content you already have on hand.

Pass the name of the folder on disk where target files are kept. The folder() generator will pick a file at random for every row inserted. The value inserted into the database is the entire contents of the file (not the name of the file in question).

Data is retrieved from the target files without any byte manipulation, line-termination conversion or national character set changes for files not ending with “.txt”. Files ending with “.txt” are treated differently. “.txt” files are read in as strings and inserted into the database as strings (with line-termination characters intact).

Use a folder containing only “.txt” files to populate columns of type clob, long varchar and such.

Use a folder containing no “.txt” files for blob, raw, long raw, binary and such.



C:\data\images

monalisa.jpg
starrynight.jpg
waterlilies.jpg

C:\data\pamphlets

monalisa.docx
starrynight.docx
waterlilies.docx

C:\data\specs

monalisa.xml
starrynight.xml
waterlilies.xml

create table christies_inventory (
   forsale_id    number(8) primary key,
   picture       image     not null,
   pamphlet      long raw  not null,
   spec_xml      clob      not null
)

table christies_inventory
   delete all
   insert 3 rows
{
   forsale_id    sequence(),
   picture       folder(“C:/data/images”),
   pamphlet      folder(“C:/data/pamphlets”),
   spec_xml      folder(“C:/data/specs”)
}



This is a simple example to illustrate use. The table includes values for all three folders on disk, but “folder()” as used above does not correlate anything: a single row might have monalisa.jpg, starrynight.docx and waterlilies.xml in it. Folder() chooses values at random, so some values may be used more than once and others not used at all.

Folder() supports both “unique” and “unique per parent” clauses.

4.4.3XML File Lookup

Like filerow()above, xpath() will read in a file and serve out contents randomly. xpath() reads in an XML file and will retrieve strings based on a supplied XPath expression. The XPath expression must match text elements directly, or Nodes that have text values (such as “<name>Malcolm Reynolds</name>”).

Consider the following XML snippet passed to xpath():

<bill session="107" type="sj" number="23">
  <titles>
      <title type="popular">Go get em!</title>
      <title type="official">Military Force Authorization</title>
  </titles>
  <sponsor id="300031" />
  <actions>
    <vote date="1000440000" how="roll" roll="281" where="s" />
    <vote date="1000520280" how="without objection" where="h" />
    <enacted law="107-40" date="1000785600"/>
  </actions>
  <subjects>
    <term name="Defense policy"/>
    <term name="Air piracy"/>
  </subjects>
  <summary>Authorizes the President to go get em.</summary>
</
bill>



Expected behavior for various XPath expressions:

XPath

Sample Matched Value

xpath() value inserted into database

/bill/@session

107

107

titles

(nearly everything)

Error – need to match text or a text node.

/bill/summary

<summary>
   Authorizes the President
   to go get ‘em.
</summary>

Authorizes the President to go get ‘em.

term/@name

"Defense policy” and
"Air piracy"

"Defense policy” or
"Air piracy"

/bill/sponsor

<sponsor id="300031" />

Error – no text matching this xml .element



Both “unique” and “unique per parent” clauses are supported for xpath().

4.4.4randomrow()

Randomrow() will pick a row at random from the named table (or lookup table) and return a DataRoller-internal structure to hold the value for each column. This returned value is useful only with the “$foo.bar” variable syntax. This is typically used when multiple values from another table need to be included and they all need to be from the same row. Use column() to pick multiple values from another table when the values do not need to be from the same row in the other table.

Using randomrow() assures all values are from the same row:



Each column reference picks random values independently:



The table named in randomrow() can be an alias for a lookup table defined at the top of the DataRoller file or the name of an actual database table. Because SQL does not have a built-in database-independent way to efficiently pick a random row from a database table, and because there is the potential for multiple tables that might need to access a table, the table named in randomrow() is read completely into DataRoller memory. Random rows are served from this in-memory cache of the table. Like column() above, this means that using randomrow() to refer to a table that is undergoing changes via DataRoller may lead to problems: the table is cached on first need. If DataRoller subsequently changes the in-database table, the in-memory cache of the table is not updated. Using randomrow() with a very large table, or a table containing large values for LOB columns may lead to memory exhaustion.



Both “unique” and “unique per parent” clauses are supported for randomrow(). Each occurrence of randomrow() with “unique” or “unique per parent” select their rows independently of each other, so adding a “unique” clause does not guarantee that both randomrow() occurrences are mutually unique.

Put another way, the following does not guarantee that a single row in table race won’t have the same values for $start and $end. Variables $start and $end may both point to the same row in location, both coincidentally picked at random.

table race
   insert 50 rows
{
   $start = randomrow(location unique),
   $end   = randomrow(location unique),

   start_lat   $start.latitude,
   start_lng   $start.longitude,


   end_lat     $end.latitude,
   end_lng     $end.longitude
}

4.4.5Previous Row

DataRoller maintains the last row it inserted to the current table so its values can be used by the subsequent row it inserts. There are two functions to accomplish this:

As a best practice, use an “if” statement to separate what DataRoller should do on the first row it inserts, and for every other row it inserts:

table race_findings
   insert 3 rows
{
   place    sequence(), // 1st place, then 2nd place, ...
   person   filerow(“names.txt”),
   points   if (haspreviousrow())
                previousrow(points) – random(1..5) // runners up below by 1 to 5
            else
                random(100 .. 120) // first place scored 100+ points!
            end
}

In the above example, a table holds the points scored by the top 3 contestants. To make the scores realistic, the first place holder has the most points, and each runner-up has between 1 and 5 fewer points. By using data keyed off of the previous row inserted, this assures that the third place winner doesn’t have more points than the second or first place winners. Each value is computed to be a little less than the previous row DataRoller inserted.

DataRoller always resets the previous row to empty when it is about to insert the first row into any table. Put another way, the data from the last row of the previous table is not available to the first row of the next table in the DataRoller script file!

The real power of previousrow() is how it operates on a table with a “child of” clause: DataRoller resets every time the parent table row changes. Function haspreviousrow() is false exactly once for every unique row in the parent table.

To extend the example above, consider race_findings with a parent race_schedule table that contains a record for every race held. Function haspreviousrow() is false for the first row inserted into race_findings for each record in race_schedule.

table race_schedule
   insert 50 rows
{
   race_id  sequence()
}

table race_findings
   child of race_schedule on this.race_id = parent.race_id
   insert 3 rows
{
   place    sequence(1 unique per parent), // 1st place, then 2nd place, ...
   person   filerow(“names.txt”),
   points   if (haspreviousrow())
                previousrow(points) – random(1..5) // runners up below by 1 to 5
            else
                random(100 .. 120) // first place scored 100+ points!
            end
}

That’s it! Function haspreviousrow() is false for each row in race_schedule because DataRoller resets the previous row internally to empty whenever it moves to another parent record.

The table below contains data from the above DataRoller script. DataRoller resets the previous row to empty each time it begins a new race in table race_findings it

race_id

place

person

points


1

1

Chloe

101

previous row reset: first row

1

2

Mary

99


1

3

Abdul

96


2

1

Chris

115

previous row reset: new parent

2

2

Arlene

111


2

3

Greg

98


3

1

Nate

105

previous row reset: new parent

3

2

George

100


3

3

Simon

94




A great use for the DataRoller previous row feature is generating routing data for use in geospatial applications. Consider a data set of tagged elk that wander around the general Calgary area. The data set has a table for each tagged elk and tracks their position via a GPS signal in their collar tag. Their position is recorded each day for a two week period, and the results are charted on a map.

table elk_routes
   child of elk on this.elk_id = parent.elk_id
   insert 14 rows
{
   route_seq   sequence(1 unique per parent),

   latitude    if (!haspreviousrow())
                   random(  50.0e0 ..   51.46e0)
               else
                   previousrow(latitude) + random(-0.06e0 .. 0.06e0)
               end,

   longitude   if (!haspreviousrow())
                   random(-116.2e0  .. -112.11e0)
               else
                   previousrow(longitude) + random(-0.06e0 .. 0.06e0)
               end
}

Column route_seq holds numbers 1 through 14 for each tagged elk (parent table). This is the tracking day. Latitude and longitude for the first row of each elk is computed randomly near Calgary in the “else” clause above. Since DataRoller returns false for haspreviousrow() for the first row of each parent table row. For each subsequent day, DataRoller picks a latitude and longitude value that is up to 0.06 degrees away from the place that the elk was yesterday (previous row).

previousrow(longitude) + random(-0.06e0 .. 0.06e0)” means pick the previous row’s longitude value and move up to 0.06 degrees west or east.

If you plot the elk_routes table as lines on Google Maps, you would see something similar to this: