backend scanner

The scanner kicks of OcrMyPDF on documents placed into the corresponding directory.

flow description

the program roughly executes as follows.

  1. database connection is established.
  2. the base class is being instanciated to work provide output and other supporting functionalities
  3. the program then goes into /data/scan (a local folder needs to be mapped via docker run command) and looks for *.pdf files

foreach found pdf the following flow is executed:

  1. a lock is checked (/tmp/ppyrdOcrMyPdf.txt) and if not existing established to ensure that we dont have concurrent OcrMyPDF processes working on the same large file.
  2. Tesseract is started. The output is written to /data/inbox
  3. We check if the output has been written.
  4. if not the original (no ocr) is moved to /data/scan/error
  5. if found - the original is moved to /data/scan/archive
  6. if a lock exists - a simple output is generated stating that we wait ...