File Poller Scheduled Job Execution Handler
The job execution handler is the component that executes the jobs scheduled by the file poller scheduler. It is responsible for:
-
Retrieving files from the configured directory using the specified filename patterns
-
Determining each file’s processed status (already processed, new, or changed since the last poll) and scheduling jobs for new files and, depending on configuration, changed files.
-
Executing each scheduled job using the execution logic defined in the
FilePollerAdapterimplementation
Determining the processed status of a file is done by hashing its content and metadata after retrieval, then comparing these hashes against previously stored values in the database for the same file path. For new or changed files, the updated hash information is stored in the database at the end of the job execution.
File Path Already Exists in the Database? |
Content and Metadata Changed? |
File Processed Status |
Action |
No |
N/A |
|
Schedule a new job for the file |
Yes |
Yes |
|
Re-schedule the current job for the file based on the configured re-schedule policy (see Changed File Job Re-schedule Policy section below) |
Yes |
No |
|
Ignore the file as it has been processed previously |
File Processing
File Processed Status
The processed status of a retrieved file is determined by comparing the hashed content and metadata of the file against an existing entry in the database with the same file path.
The maximum number of retrieved files that can have their content and metadata hashed concurrently can be configured using the following parameter: ipf.file-poller.pollers[N].file-processing-parallelism (where N is the index of the specified polling job, starting from 0). The default value is 128.
The buffer size controlling the maximum batch of retrieved files (after hashing) for processed status determination can be configured using the following parameter: ipf.file-poller.pollers[N].file-processing-buffer. The default value is 500.
The buffer size controlling the maximum batch of retrieved files (after hashing) for processed status determination can be configured using the following parameter: ipf.file-poller.pollers[N].file-processing-buffer. This determines how many files are sent per database query and the total number of database queries performed per polling cycle. The default value is 500.
Hashing
File content and metadata are hashed using the MD5 message-digest algorithm. Hashing of file content is performed using a buffered input stream with a buffer size (in bytes) that can be configured using the following parameter: ipf.file-poller.pollers[N].file-content-hash-buffer-bytes. The default value is 8192 (8KB).
Changed File Job Re-schedule Policy
The job execution handler will handle CHANGED files based on the configured re-schedule policy. The three available policy options are:
-
NEVER: Do not re-schedule jobs for changed files. -
ALWAYS: Always re-schedule jobs for changed files, regardless of the current job execution status of the file. This is more lightweight than theIGNORE_TRIGGEREDoption, but may create duplicate jobs for the same file. -
IGNORE_TRIGGERED: Do not re-schedule jobs if the previous job execution status isTRIGGEREDor the previous job cannot be found (as it’s not possible to determine whether the previous job has already been executed)
The changed file re-schedule policy can be configured using the following parameter: ipf.file-poller.pollers[N].changed-file-job-reschedule-policy. The default value is ALWAYS.