bot “bb_jnegocios1_dl_edition.php”
Purpose
This is a tool to download a single date-specific edition of “Jornal de Negocios” – a very good newspaper on business and markets, focused on the Portuguese context – as available from
https://quiosque.cofina.pt/jornal-de-negocios/
or
https://quiosque.cofina.pt/jornal-de-negocios/yyyymmdd
For example:
https://quiosque.cofina.pt/jornal-de-negocios/20210729
This is content only available to subscribers. I am a longtime subscriber, but I very much prefer to have all the content offline and compiled together, for me to consume whenever I want, regardless of internet connection availability. Publishers usually do NOT provide this level of control over the contents, so I have to write my own tools. This post is a glimpse on one of the tools.
Related projects of my own
This depends on my AmConsole class, to handle the user’s command-line arguments, using a pattern of my-own for forcing a certain discipline for default values, validations and descriptions.
The most important code in the project, by far, is class “QuiosqueCofinaPT”, which does all the impersonation jobs: login, browse to edition, flip the pages, save snapshots, etc.
That class “QuiosqueCofinaPT” depends on another lower-level class of mine named “AmWebDriver”, which directly interfaces with a running instance of Selenium hub, which then controls a running web-browser. Firefox (ESR) is the version in use.
There is an external Scrivener book file “bot_jornaldenegocios.scrivx” which captures the evolution of the project that resulted in this new bot for Blogbot.
Example calls
php bb_jnegocios1_dl_edition 2021 7 29 4444 #all possible arguments given php bb_jnegocios1_dl_edition 2021 7 29 #omits the Selenium driver port, defaults to 4444 php bb_jnegocios1_dl_edition 2021 7 #omits the day and the port, defaults to current day and port 4444 php bb_jnegocios1_dl_edition 2021 #omits the month, the day and the port; defaults to current month and day, port 444 php bb_jnegocios1_dl_edition #omits everything, default to current date and port 4444
Source code (of the dl script only, not of the supporting classes)
<?php
require_once "./vendor/autoload.php";
use am\util\AmDate;
use am\internet\HttpHelper;
use am\internet\QuiosqueCofinaPT;
use am\console\Console;
define ("THIS_HARVESTER_NAME", "BB 'quiosque.cofina.pt/jornal-de-negocios/' [from] daily edition harvester".PHP_EOL);
define ("THIS_HARVESTER_VERSION", "v20210728 2000".PHP_EOL);
echo THIS_HARVESTER_NAME;
echo THIS_HARVESTER_VERSION;
const MIN_NUMBER_OF_ARGUMENTS_THE_USER_MUST_PROVIDE = 0;
const ARGUMENT_YEAR_INDEX_IN_ARGV = 1;
const ARGUMENT_MONTH_INDEX_IN_ARGV = 2;
const ARGUMENT_DAY_INDEX_IN_ARGV = 3;
const ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV = 4;
const DEFAULT_VALUE_FOR_ARGUMENT_DRIVER_PORT = \am\internet\AmChromeDriver::SELENIUM_HUB_DEFAULT_SERVER_PORT; //default for selenium (do not confuse with chromedriver.exe 9515 port)
// to use AmConsole, one must provide a validation function per possible argument
// in this case, all args can be validated by the same function 'validateIsIntegerGTOE1'
$arrayOfValidationFunctions = [
ARGUMENT_YEAR_INDEX_IN_ARGV => "validateIsIntegerGTOE1",
ARGUMENT_MONTH_INDEX_IN_ARGV => "validateIsIntegerGTOE1",
ARGUMENT_DAY_INDEX_IN_ARGV => "validateIsIntegerGTOE1",
ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV => "validateIsIntegerGTOE1"
];
// to use AmConsole, one must provide describe every possible argument
$arrayOfDescriptorsOneForEachCommandLineArg = [
ARGUMENT_YEAR_INDEX_IN_ARGV => "Integer >=1 can be supplied, for year (defaults to system's year).",
ARGUMENT_MONTH_INDEX_IN_ARGV => "Integer >=1 can be supplied, for month (defaults to system's month).",
ARGUMENT_DAY_INDEX_IN_ARGV => "Integer >=1 can be supplied, for day (defaults to system's day).",
ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV => "Integer >=1 expected, for driver port (defaults to 4444).",
];
// to use AmConsole, one must provide describe default values for every possible argument that the user can omit
$strCurrentDate = date("Y-m-d");
$aCurrentDate = explode("-", $strCurrentDate);
$iYear = intval($aCurrentDate[0]);
$iMonth = intval($aCurrentDate[1]);
$iDay = intval($aCurrentDate[2]);
$arrayOfDefaultValues = [
0 => __FILE__ //always like this, to state this very same script as one argument
,
ARGUMENT_YEAR_INDEX_IN_ARGV => $iYear
,
ARGUMENT_MONTH_INDEX_IN_ARGV => $iMonth
,
ARGUMENT_DAY_INDEX_IN_ARGV => $iDay
,
ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV => DEFAULT_VALUE_FOR_ARGUMENT_DRIVER_PORT
];
//-------------------- VALIDATORS START --------------------
function validateIsIntegerGTOE1 (
$pInt
) : bool
{
$iResult = \am\util\Util::toInteger($pInt);
return $iResult ? $iResult>=1 : false;
}//validateIsIntegerGTOE1
//-------------------- VALIDATORS END --------------------
//----------- ACTION (PROBLEM SPECIFIC) STARTS------------
function action(
$pConsole
){
$y = intval($pConsole->mArgv[ARGUMENT_YEAR_INDEX_IN_ARGV]); //if the values that populated the mArgv object are user supplied they'll be strings
$m = intval($pConsole->mArgv[ARGUMENT_MONTH_INDEX_IN_ARGV]);
$d = intval($pConsole->mArgv[ARGUMENT_DAY_INDEX_IN_ARGV]);
$driverPort = intval($pConsole->mArgv[ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV]);
$bValidDate = \am\util\DateTools::validDay($y, $m, $d);
if ($bValidDate){
echo "Valid date received. Will now download the JN publications.".PHP_EOL;
/*
* these secrets can be captured on the PHP LOG FILE!
* TODO: how to avoid this security risk?
* https://websec.io/2018/06/14/Keep-Credentials-Secure.html
*/
$o = new QuiosqueCofinaPT(
SECRET_QUIOSQUE_COFINA_LOGIN_NAME_1,
SECRET_QUIOSQUE_COFINA_PASSWORD_1,
$driverPort,
HttpHelper::USER_AGENT_STRING_CHROME_70
);
$loginRet = $o->actionLogin();
$startDate = new AmDate($y, $m, $d);
$bIsSunday = $startDate->isSunday();
if (!$bIsSunday){
$o->browseDailyEditionAndSnapshotSaveAllPairsOfPages(
$startDate->mY,
$startDate->mM,
$startDate->mD,
"dls"
);
}//if NOT sunday
}//if valid date
else{
echo "Call aborted - please supply a valid date!".PHP_EOL;
}//else
}//action
//----------- ACTION (PROBLEM SPECIFIC) ENDS------------
/*
* the __construct constructor of AmConsole throws an Exception when no command line arguments (including no script name) are received
* PHPSTORM will signal a warning of "unhandled Exception" for the a call without try/catch
*/
try {
$oConsole = new \am\console\AmConsole(
$argv,
$pMinNumberOfArguments = MIN_NUMBER_OF_ARGUMENTS_THE_USER_MUST_PROVIDE,
$arrayOfDefaultValues,
$arrayOfValidationFunctions,
$arrayOfDescriptorsOneForEachCommandLineArg
);
}//try
catch (Exception $e){
echo $e->getMessage();
}//catch
echo $oConsole; //a summary of everything received
$c0 = $oConsole->allArgsOK();
if ($c0) action (
$oConsole
);
else{
echo "Did NOT call the script, because 1+ argument(s) was not OK.".PHP_EOL;
}
Results
In the end, this bot produces files in an automatically created folder, containing snapshots of the pages. Other tools will OCR and compile the contents together.