Private personal pet project: downloader for "Jornal de Negócios"

bot “bb_jnegocios1_dl_edition.php”

Purpose
This is a tool to download a single date-specific edition of “Jornal de Negocios” – a very good newspaper on business and markets, focused on the Portuguese context – as available from
https://quiosque.cofina.pt/jornal-de-negocios/
or
https://quiosque.cofina.pt/jornal-de-negocios/yyyymmdd

For example:
https://quiosque.cofina.pt/jornal-de-negocios/20210729

This is content only available to subscribers. I am a longtime subscriber, but I very much prefer to have all the content offline and compiled together, for me to consume whenever I want, regardless of internet connection availability. Publishers usually do NOT provide this level of control over the contents, so I have to write my own tools. This post is a glimpse on one of the tools.

Related projects of my own

This depends on my AmConsole class, to handle the user’s command-line arguments, using a pattern of my-own for forcing a certain discipline for default values, validations and descriptions.

The most important code in the project, by far, is class “QuiosqueCofinaPT”, which does all the impersonation jobs: login, browse to edition, flip the pages, save snapshots, etc.
That class “QuiosqueCofinaPT” depends on another lower-level class of mine named “AmWebDriver”, which directly interfaces with a running instance of Selenium hub, which then controls a running web-browser. Firefox (ESR) is the version in use.

There is an external Scrivener book file “bot_jornaldenegocios.scrivx” which captures the evolution of the project that resulted in this new bot for Blogbot.

Example calls

php  bb_jnegocios1_dl_edition 2021 7 29 4444 #all possible arguments given
php  bb_jnegocios1_dl_edition 2021 7 29 #omits the Selenium driver port, defaults to 4444
php  bb_jnegocios1_dl_edition 2021 7 #omits the day and the port, defaults to current day and port 4444
php  bb_jnegocios1_dl_edition 2021 #omits the month, the day and the port; defaults to current month and day, port 444
php  bb_jnegocios1_dl_edition #omits everything, default to current date and port 4444

Source code (of the dl script only, not of the supporting classes)

<?php
require_once  "./vendor/autoload.php";

use am\util\AmDate;
use am\internet\HttpHelper;
use am\internet\QuiosqueCofinaPT;
use am\console\Console;

define ("THIS_HARVESTER_NAME", "BB 'quiosque.cofina.pt/jornal-de-negocios/' &#91;from&#93; daily edition harvester".PHP_EOL);
define ("THIS_HARVESTER_VERSION", "v20210728 2000".PHP_EOL);

echo THIS_HARVESTER_NAME;
echo THIS_HARVESTER_VERSION;

const MIN_NUMBER_OF_ARGUMENTS_THE_USER_MUST_PROVIDE = 0;
const ARGUMENT_YEAR_INDEX_IN_ARGV = 1;
const ARGUMENT_MONTH_INDEX_IN_ARGV = 2;
const ARGUMENT_DAY_INDEX_IN_ARGV = 3;
const ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV = 4;

const DEFAULT_VALUE_FOR_ARGUMENT_DRIVER_PORT = \am\internet\AmChromeDriver::SELENIUM_HUB_DEFAULT_SERVER_PORT; //default for selenium (do not confuse with chromedriver.exe 9515 port)

// to use AmConsole, one must provide a validation function per possible argument
// in this case, all args can be validated by the same function 'validateIsIntegerGTOE1'
$arrayOfValidationFunctions = &#91;
    ARGUMENT_YEAR_INDEX_IN_ARGV => "validateIsIntegerGTOE1",
    ARGUMENT_MONTH_INDEX_IN_ARGV => "validateIsIntegerGTOE1",
    ARGUMENT_DAY_INDEX_IN_ARGV => "validateIsIntegerGTOE1",
    ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV => "validateIsIntegerGTOE1"
];

// to use AmConsole, one must provide describe every possible argument
$arrayOfDescriptorsOneForEachCommandLineArg = [
    ARGUMENT_YEAR_INDEX_IN_ARGV => "Integer >=1 can be supplied, for year (defaults to system's year).",
    ARGUMENT_MONTH_INDEX_IN_ARGV => "Integer >=1 can be supplied, for month (defaults to system's month).",
    ARGUMENT_DAY_INDEX_IN_ARGV => "Integer >=1 can be supplied, for day (defaults to system's day).",
    ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV => "Integer >=1 expected, for driver port (defaults to 4444).",
];

// to use AmConsole, one must provide describe default values for every possible argument that the user can omit
$strCurrentDate = date("Y-m-d");
$aCurrentDate = explode("-", $strCurrentDate);
$iYear = intval($aCurrentDate[0]);
$iMonth = intval($aCurrentDate[1]);
$iDay = intval($aCurrentDate[2]);
$arrayOfDefaultValues = [
    0 => __FILE__ //always like this, to state this very same script as one argument
    ,
    ARGUMENT_YEAR_INDEX_IN_ARGV => $iYear
    ,
    ARGUMENT_MONTH_INDEX_IN_ARGV => $iMonth
    ,
    ARGUMENT_DAY_INDEX_IN_ARGV => $iDay
    ,
    ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV => DEFAULT_VALUE_FOR_ARGUMENT_DRIVER_PORT
];

//-------------------- VALIDATORS START --------------------

function validateIsIntegerGTOE1 (
    $pInt
) : bool
{
    $iResult = \am\util\Util::toInteger($pInt);
    return $iResult ? $iResult>=1 : false;
}//validateIsIntegerGTOE1

//-------------------- VALIDATORS END --------------------

//----------- ACTION (PROBLEM SPECIFIC) STARTS------------
function action(
    $pConsole
){
    $y = intval($pConsole->mArgv[ARGUMENT_YEAR_INDEX_IN_ARGV]); //if the values that populated the mArgv object are user supplied they'll be strings
    $m = intval($pConsole->mArgv[ARGUMENT_MONTH_INDEX_IN_ARGV]);
    $d = intval($pConsole->mArgv[ARGUMENT_DAY_INDEX_IN_ARGV]);
    $driverPort = intval($pConsole->mArgv[ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV]);

    $bValidDate = \am\util\DateTools::validDay($y, $m, $d);
    if ($bValidDate){
        echo "Valid date received. Will now download the JN publications.".PHP_EOL;
        /*
         * these secrets can be captured on the PHP LOG FILE!
         * TODO: how to avoid this security risk?
         * https://websec.io/2018/06/14/Keep-Credentials-Secure.html
         */
        $o = new QuiosqueCofinaPT(
            SECRET_QUIOSQUE_COFINA_LOGIN_NAME_1,
            SECRET_QUIOSQUE_COFINA_PASSWORD_1,

            $driverPort,
            HttpHelper::USER_AGENT_STRING_CHROME_70
        );
        $loginRet = $o->actionLogin();
        $startDate = new AmDate($y, $m, $d);
        $bIsSunday = $startDate->isSunday();

        if (!$bIsSunday){
            $o->browseDailyEditionAndSnapshotSaveAllPairsOfPages(
                $startDate->mY,
                $startDate->mM,
                $startDate->mD,
                "dls"
            );
        }//if NOT sunday
    }//if valid date
    else{
        echo "Call aborted - please supply a valid date!".PHP_EOL;
    }//else
}//action

//----------- ACTION (PROBLEM SPECIFIC) ENDS------------

/*
 * the __construct constructor of AmConsole throws an Exception when no command line arguments (including no script name) are received
 * PHPSTORM will signal a warning of "unhandled Exception" for the a call without try/catch
 */
try {
    $oConsole = new \am\console\AmConsole(
        $argv,
        $pMinNumberOfArguments = MIN_NUMBER_OF_ARGUMENTS_THE_USER_MUST_PROVIDE,
        $arrayOfDefaultValues,
        $arrayOfValidationFunctions,
        $arrayOfDescriptorsOneForEachCommandLineArg
    );
}//try
catch (Exception $e){
    echo $e->getMessage();
}//catch

echo $oConsole; //a summary of everything received
$c0 = $oConsole->allArgsOK();
if ($c0) action (
    $oConsole
);
else{
    echo "Did NOT call the script, because 1+ argument(s) was not OK.".PHP_EOL;
}

Results
In the end, this bot produces files in an automatically created folder, containing snapshots of the pages. Other tools will OCR and compile the contents together.