Opened 9 years ago

Last modified 9 years ago

#47574 new request

port request: 'tabula' and 'tabula-extractor'

Reported by: KurtPfeifle (Kurt Pfeifle) Owned by: macports-tickets@…
Priority: Normal Milestone:
Component: ports Version:
Keywords: Cc:
Port:

Description

The self-decription of Tabula project is quite telling and appropriate:

"Tabula is a tool for liberating data tables trapped inside PDF files."

Here is the link to the sources:


Extracting tables from PDF pages into a usable spreadsheet format is extremely difficult. Here is some background information:

Given the scope of this task, Tabula works extremely well.

Tabula family of tools is written in Ruby. In the background they make use of PDFBox (which is written in Java) and a few other third-party libs. To run the command line tool tabula, hosted in the Tabula-Extractor repository, requires JRuby-1.7 installed.My JRuby is the Macports version.

I've been successful to run tabula directly from the cloned git repository:

    mkdir ~/svn-stuff
    cd ~/svn-stuff
    git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor

Included in this Git clone will already be the required libraries, so no need to install PDFBox. The command line tool is in the /bin/ subdirectory.

Exploring the command line options:

    ~/svn-stuff/git.tabula-extractor/bin/tabula -h
    
    Tabula helps you extract tables from PDFs
    
    Usage:
           tabula [options] <pdf_file>
    where [options] are:
             --pages, -p <s>:   Comma separated list of ranges, or all. Examples:
                                --pages 1-3,5-7, --pages 3 or --pages all. Default
                                is --pages 1 (default: 1)
              --area, -a <s>:   Portion of the page to analyze
                                (top,left,bottom,right). Example: --area
                                269.875,12.75,790.5,561. Default is entire page
           --columns, -c <s>:   X coordinates of column boundaries. Example
                                --columns 10.1,20.2,30.3
          --password, -s <s>:   Password to decrypt document. Default is empty
                                (default: )
                 --guess, -g:   Guess the portion of the page to analyze per page.
                 --debug, -d:   Print detected table areas instead of processing.
            --format, -f <s>:   Output format (CSV,TSV,HTML,JSON) (default: CSV)
           --outfile, -o <s>:   Write output to <file> instead of STDOUT (default:
                                -)
           --spreadsheet, -r:   Force PDF to be extracted using spreadsheet-style
                                extraction (if there are ruling lines separating
                                each cell, as in a PDF of an Excel spreadsheet)
        --no-spreadsheet, -n:   Force PDF not to be extracted using
                                spreadsheet-style extraction (if there are ruling
                                lines separating each cell, as in a PDF of an Excel
                                spreadsheet)
                --silent, -i:   Suppress all stderr output.
      --use-line-returns, -u:   Use embedded line returns in cells. (Only in
                                spreadsheet mode.)
               --version, -v:   Print version and exit
                  --help, -h:   Show this message

Change History (1)

comment:1 Changed 9 years ago by mf2k (Frank Schima)

Keywords: PDF table csv tsv spreadsheet removed
Note: See TracTickets for help on using tickets.