CSV Splitter

The CSV Splitter provided by IPF allows for splitting CSV file into separate components to use in debulking.

It takes a fairly restricted component hierarchy configuration to define the structure of the CSV file.

Maven Dependency

To use the CSV splitter, the following dependency must be provided, with a version matching ipf-debulker-core to ensure compatibility.

<dependency>
    <groupId>com.iconsolutions.ipf.debulk</groupId>
    <artifactId>ipf-debulker-csv-splitter</artifactId>
    <version>${ipf-debulker-core.version}</version>
</dependency>

Component Hierarchy Configuration

Unlike the XML and JSON splitters, the CSV splitter only supports splitting a file into root components with a single child. If the input CSV structure features levels of nesting (e.g. based on a record type field) then consider writing a custom splitter that keeps the current record type as part of its state.

When configuring the root component, the marker must be either "header" or "none". Selecting "header" assigns the content of the first line of the file to the root component, while choosing "none" leaves the root component with null content.

The only allowed marker for the child component (of which only one is allowed) is 'body'. If the root is set to "header", each child component will contain the subsequent lines of the file. When set to "none", the child components will include the first line as well, ensuring compatibility with CSV files that lack a header.

Usage Example

Imagine we want to process potentially large CSV files containing data about books in a library and split it into individual records, so they can be used by some downstream system.

The example file is small for demonstration purposes, but it could contain a large number of lines.

example.csv

Library,Author,Title,Chapter,Start Page
Library of Alexandria,"Martin, Robert",Clean Code,Clean Code,1
Library of Alexandria,"Martin, Robert",Clean Code,Meaningful Names,17
Library of Alexandria,"Bloch, Joshua",Effective Java,Introduction,1
Library of Alexandria,"Bloch, Joshua",Effective Java,Creating and Destroying Objects,5

Given our example CSV data, we might decide to split it using the following hierarchy.

header
└── body

With these prerequisites out the way; let’s write the program. Bear in mind, this example makes use of Project Reactor to convert the Java 9 Flow.Publisher to a Flux to make subscribing to the data a bit simpler, but this could be replaced with another reactive library that is compatible with the Java 9 reactive libraries, e.g. RxJava.

ActorSystem system = ActorSystem.create();
Splitter splitter = new AkkaCsvSplitter(system);

InputStream stream = getClass().getClassLoader().getResourceAsStream("example.csv");

ComponentHierarchy root = ComponentHierarchy.root("header");
ComponentHierarchy body = root.addChild("body");

Flux<DebulkComponent> flux = splitter.split(stream, root);
List<DebulkComponent> components = flux.collectList().block();

components.forEach(System.out::println);

Running this code should print a series of extracted components to the console.

output

DebulkComponent(bulkId=null, id=, parentId=, marker=body, index=null, content=Library of Alexandria,"Martin, Robert",Clean Code,Clean Code,1)
DebulkComponent(bulkId=null, id=, parentId=, marker=body, index=null, content=Library of Alexandria,"Martin, Robert",Clean Code,Meaningful Names,17)
DebulkComponent(bulkId=null, id=, parentId=, marker=body, index=null, content=Library of Alexandria,"Bloch, Joshua",Effective Java,Introduction,1)
DebulkComponent(bulkId=null, id=, parentId=, marker=body, index=null, content=Library of Alexandria,"Bloch, Joshua",Effective Java,Creating and Destroying Objects,5)
DebulkComponent(bulkId=null, id=, parentId=null, marker=header, index=null, content=Library,Author,Title,Chapter,Start Page)