bot “bb_jnegocios1_dl_edition.php”
Purpose
This is a tool to download a single date-specific edition of “Jornal de Negocios” – a very good newspaper on business and markets, focused on the Portuguese context – as available from
https://quiosque.cofina.pt/jornal-de-negocios/
or
https://quiosque.cofina.pt/jornal-de-negocios/yyyymmdd
For example:
https://quiosque.cofina.pt/jornal-de-negocios/20210729
This is content only available to subscribers. I am a longtime subscriber, but I very much prefer to have all the content offline and compiled together, for me to consume whenever I want, regardless of internet connection availability. Publishers usually do NOT provide this level of control over the contents, so I have to write my own tools. This post is a glimpse on one of the tools.
Related projects of my own
This depends on my AmConsole class, to handle the user’s command-line arguments, using a pattern of my-own for forcing a certain discipline for default values, validations and descriptions.
The most important code in the project, by far, is class “QuiosqueCofinaPT”, which does all the impersonation jobs: login, browse to edition, flip the pages, save snapshots, etc.
That class “QuiosqueCofinaPT” depends on another lower-level class of mine named “AmWebDriver”, which directly interfaces with a running instance of Selenium hub, which then controls a running web-browser. Firefox (ESR) is the version in use.
There is an external Scrivener book file “bot_jornaldenegocios.scrivx” which captures the evolution of the project that resulted in this new bot for Blogbot.
Example calls
php bb_jnegocios1_dl_edition 2021 7 29 4444 #all possible arguments given php bb_jnegocios1_dl_edition 2021 7 29 #omits the Selenium driver port, defaults to 4444 php bb_jnegocios1_dl_edition 2021 7 #omits the day and the port, defaults to current day and port 4444 php bb_jnegocios1_dl_edition 2021 #omits the month, the day and the port; defaults to current month and day, port 444 php bb_jnegocios1_dl_edition #omits everything, default to current date and port 4444
Source code (of the dl script only, not of the supporting classes)
<?php require_once "./vendor/autoload.php"; use am\util\AmDate; use am\internet\HttpHelper; use am\internet\QuiosqueCofinaPT; use am\console\Console; define ("THIS_HARVESTER_NAME", "BB 'quiosque.cofina.pt/jornal-de-negocios/' [from] daily edition harvester".PHP_EOL); define ("THIS_HARVESTER_VERSION", "v20210728 2000".PHP_EOL); echo THIS_HARVESTER_NAME; echo THIS_HARVESTER_VERSION; const MIN_NUMBER_OF_ARGUMENTS_THE_USER_MUST_PROVIDE = 0; const ARGUMENT_YEAR_INDEX_IN_ARGV = 1; const ARGUMENT_MONTH_INDEX_IN_ARGV = 2; const ARGUMENT_DAY_INDEX_IN_ARGV = 3; const ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV = 4; const DEFAULT_VALUE_FOR_ARGUMENT_DRIVER_PORT = \am\internet\AmChromeDriver::SELENIUM_HUB_DEFAULT_SERVER_PORT; //default for selenium (do not confuse with chromedriver.exe 9515 port) // to use AmConsole, one must provide a validation function per possible argument // in this case, all args can be validated by the same function 'validateIsIntegerGTOE1' $arrayOfValidationFunctions = [ ARGUMENT_YEAR_INDEX_IN_ARGV => "validateIsIntegerGTOE1", ARGUMENT_MONTH_INDEX_IN_ARGV => "validateIsIntegerGTOE1", ARGUMENT_DAY_INDEX_IN_ARGV => "validateIsIntegerGTOE1", ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV => "validateIsIntegerGTOE1" ]; // to use AmConsole, one must provide describe every possible argument $arrayOfDescriptorsOneForEachCommandLineArg = [ ARGUMENT_YEAR_INDEX_IN_ARGV => "Integer >=1 can be supplied, for year (defaults to system's year).", ARGUMENT_MONTH_INDEX_IN_ARGV => "Integer >=1 can be supplied, for month (defaults to system's month).", ARGUMENT_DAY_INDEX_IN_ARGV => "Integer >=1 can be supplied, for day (defaults to system's day).", ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV => "Integer >=1 expected, for driver port (defaults to 4444).", ]; // to use AmConsole, one must provide describe default values for every possible argument that the user can omit $strCurrentDate = date("Y-m-d"); $aCurrentDate = explode("-", $strCurrentDate); $iYear = intval($aCurrentDate[0]); $iMonth = intval($aCurrentDate[1]); $iDay = intval($aCurrentDate[2]); $arrayOfDefaultValues = [ 0 => __FILE__ //always like this, to state this very same script as one argument , ARGUMENT_YEAR_INDEX_IN_ARGV => $iYear , ARGUMENT_MONTH_INDEX_IN_ARGV => $iMonth , ARGUMENT_DAY_INDEX_IN_ARGV => $iDay , ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV => DEFAULT_VALUE_FOR_ARGUMENT_DRIVER_PORT ]; //-------------------- VALIDATORS START -------------------- function validateIsIntegerGTOE1 ( $pInt ) : bool { $iResult = \am\util\Util::toInteger($pInt); return $iResult ? $iResult>=1 : false; }//validateIsIntegerGTOE1 //-------------------- VALIDATORS END -------------------- //----------- ACTION (PROBLEM SPECIFIC) STARTS------------ function action( $pConsole ){ $y = intval($pConsole->mArgv[ARGUMENT_YEAR_INDEX_IN_ARGV]); //if the values that populated the mArgv object are user supplied they'll be strings $m = intval($pConsole->mArgv[ARGUMENT_MONTH_INDEX_IN_ARGV]); $d = intval($pConsole->mArgv[ARGUMENT_DAY_INDEX_IN_ARGV]); $driverPort = intval($pConsole->mArgv[ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV]); $bValidDate = \am\util\DateTools::validDay($y, $m, $d); if ($bValidDate){ echo "Valid date received. Will now download the JN publications.".PHP_EOL; /* * these secrets can be captured on the PHP LOG FILE! * TODO: how to avoid this security risk? * https://websec.io/2018/06/14/Keep-Credentials-Secure.html */ $o = new QuiosqueCofinaPT( SECRET_QUIOSQUE_COFINA_LOGIN_NAME_1, SECRET_QUIOSQUE_COFINA_PASSWORD_1, $driverPort, HttpHelper::USER_AGENT_STRING_CHROME_70 ); $loginRet = $o->actionLogin(); $startDate = new AmDate($y, $m, $d); $bIsSunday = $startDate->isSunday(); if (!$bIsSunday){ $o->browseDailyEditionAndSnapshotSaveAllPairsOfPages( $startDate->mY, $startDate->mM, $startDate->mD, "dls" ); }//if NOT sunday }//if valid date else{ echo "Call aborted - please supply a valid date!".PHP_EOL; }//else }//action //----------- ACTION (PROBLEM SPECIFIC) ENDS------------ /* * the __construct constructor of AmConsole throws an Exception when no command line arguments (including no script name) are received * PHPSTORM will signal a warning of "unhandled Exception" for the a call without try/catch */ try { $oConsole = new \am\console\AmConsole( $argv, $pMinNumberOfArguments = MIN_NUMBER_OF_ARGUMENTS_THE_USER_MUST_PROVIDE, $arrayOfDefaultValues, $arrayOfValidationFunctions, $arrayOfDescriptorsOneForEachCommandLineArg ); }//try catch (Exception $e){ echo $e->getMessage(); }//catch echo $oConsole; //a summary of everything received $c0 = $oConsole->allArgsOK(); if ($c0) action ( $oConsole ); else{ echo "Did NOT call the script, because 1+ argument(s) was not OK.".PHP_EOL; }
Results
In the end, this bot produces files in an automatically created folder, containing snapshots of the pages. Other tools will OCR and compile the contents together.