File Poller Scheduled Job Execution Handler

The job execution handler is the component that executes the jobs scheduled by the file poller scheduler. It is responsible for:

  • Retrieving files from the configured directory using the specified filename patterns

  • Determining each file’s processed status (already processed, new, or changed since the last poll) and scheduling jobs for new files and, depending on configuration, changed files.

  • Executing each scheduled job using the execution logic defined in the FilePollerAdapter implementation

Determining the processed status of a file is done by hashing its content and metadata after retrieval, then comparing these hashes against previously stored values in the database for the same file path. For new or changed files, the updated hash information is stored in the database at the end of the job execution.

File Path Already Exists in the Database?

Content and Metadata Changed?

File Processed Status

Action

No

N/A

NEW

Schedule a new job for the file

Yes

Yes

CHANGED

Re-schedule the current job for the file based on the configured re-schedule policy (see Changed File Job Re-schedule Policy section below)

Yes

No

PROCESSED

Ignore the file as it has been processed previously

File Processing

File Processed Status

The processed status of a retrieved file is determined by comparing the hashed content and metadata of the file against an existing entry in the database with the same file path.

The maximum number of retrieved files that can have their content and metadata hashed concurrently can be configured using the following parameter: ipf.file-poller.pollers[N].file-processing-parallelism (where N is the index of the specified polling job, starting from 0). The default value is 128.

The buffer size controlling the maximum batch of retrieved files (after hashing) for processed status determination can be configured using the following parameter: ipf.file-poller.pollers[N].file-processing-buffer. The default value is 500.

The buffer size controlling the maximum batch of retrieved files (after hashing) for processed status determination can be configured using the following parameter: ipf.file-poller.pollers[N].file-processing-buffer. This determines how many files are sent per database query and the total number of database queries performed per polling cycle. The default value is 500.

Hashing

File content and metadata are hashed using the MD5 message-digest algorithm. Hashing of file content is performed using a buffered input stream with a buffer size (in bytes) that can be configured using the following parameter: ipf.file-poller.pollers[N].file-content-hash-buffer-bytes. The default value is 8192 (8KB).

Changed File Job Re-schedule Policy

The job execution handler will handle CHANGED files based on the configured re-schedule policy. The three available policy options are:

  • NEVER: Do not re-schedule jobs for changed files.

  • ALWAYS: Always re-schedule jobs for changed files, regardless of the current job execution status of the file. This is more lightweight than the IGNORE_TRIGGERED option, but may create duplicate jobs for the same file.

  • IGNORE_TRIGGERED: Do not re-schedule jobs if the previous job execution status is TRIGGERED or the previous job cannot be found (as it’s not possible to determine whether the previous job has already been executed)

The changed file re-schedule policy can be configured using the following parameter: ipf.file-poller.pollers[N].changed-file-job-reschedule-policy. The default value is ALWAYS.