CSV Splitter
The CSV Splitter provided by IPF allows for splitting CSV file into separate components to use in debulking.
It takes a fairly restricted component hierarchy configuration to define the structure of the CSV file.
Maven Dependency
To use the CSV splitter, the following dependency must be provided, with a version matching ipf-debulker-core to ensure compatibility.
<dependency>
<groupId>com.iconsolutions.ipf.debulk</groupId>
<artifactId>ipf-debulker-csv-splitter</artifactId>
<version>${ipf-debulker-core.version}</version>
</dependency>
Component Hierarchy Configuration
Unlike the XML and JSON splitters, the CSV splitter only really supports splitting a file into root components with a single child.This is simply because CSV typically lends itself to flat data structures.
When configuring the root component, the marker must be either "header" or "none". Selecting "header" assigns the content of the first line of the file to the root component, while choosing "none" leaves the root component with null content.
The only allowed marker for the child component (of which only one is allowed) is 'body'. If the root is set to "header", each child component will contain the subsequent lines of the file. When set to "none", the child components will include the first line as well, ensuring compatibility with CSV files that lack a header.
Usage Example
Imagine we want to process potentially large CSV files containing data about books in a library and split it into individual records, so they can be used by some downstream system.
The example file is small for demonstration purposes, but it could contain a large number of lines.
example.csv
Library,Author,Title,Chapter,Start Page
Library of Alexandria,"Martin, Robert",Clean Code,Clean Code,1
Library of Alexandria,"Martin, Robert",Clean Code,Meaningful Names,17
Library of Alexandria,"Bloch, Joshua",Effective Java,Introduction,1
Library of Alexandria,"Bloch, Joshua",Effective Java,Creating and Destroying Objects,5
Given our example CSV data, we might decide to split it using the following hierarchy.
header
└── body
With these prerequisites out the way; let’s write the program. Bear in mind, this example makes use of Project Reactor to convert the Java 9 Flow.Publisher to a Flux to make subscribing to the data a bit simpler, but this could be replaced with another reactive library that is compatible with the Java 9 reactive libraries, e.g. RxJava.
ActorSystem system = ActorSystem.create();
Splitter splitter = new AkkaCsvSplitter(system);
InputStream stream = getClass().getClassLoader().getResourceAsStream("example.csv");
ComponentHierarchy root = ComponentHierarchy.root("header");
ComponentHierarchy body = root.addChild("body");
Flux<DebulkComponent> flux = splitter.split(stream, root);
List<DebulkComponent> components = flux.collectList().block();
components.forEach(System.out::println);
Running this code should print a series of extracted components to the console.
output
DebulkComponent(bulkId=null, id=, parentId=, marker=body, index=null, content=Library of Alexandria,"Martin, Robert",Clean Code,Clean Code,1)
DebulkComponent(bulkId=null, id=, parentId=, marker=body, index=null, content=Library of Alexandria,"Martin, Robert",Clean Code,Meaningful Names,17)
DebulkComponent(bulkId=null, id=, parentId=, marker=body, index=null, content=Library of Alexandria,"Bloch, Joshua",Effective Java,Introduction,1)
DebulkComponent(bulkId=null, id=, parentId=, marker=body, index=null, content=Library of Alexandria,"Bloch, Joshua",Effective Java,Creating and Destroying Objects,5)
DebulkComponent(bulkId=null, id=, parentId=null, marker=header, index=null, content=Library,Author,Title,Chapter,Start Page)