Tag Archives: php

php (post_tag auto created by Wordpresser)

Private personal pet project: downloader for "Jornal de Negócios"

bot “bb_jnegocios1_dl_edition.php”

Purpose
This is a tool to download a single date-specific edition of “Jornal de Negocios” – a very good newspaper on business and markets, focused on the Portuguese context – as available from
https://quiosque.cofina.pt/jornal-de-negocios/
or
https://quiosque.cofina.pt/jornal-de-negocios/yyyymmdd

For example:
https://quiosque.cofina.pt/jornal-de-negocios/20210729

This is content only available to subscribers. I am a longtime subscriber, but I very much prefer to have all the content offline and compiled together, for me to consume whenever I want, regardless of internet connection availability. Publishers usually do NOT provide this level of control over the contents, so I have to write my own tools. This post is a glimpse on one of the tools.

Related projects of my own

This depends on my AmConsole class, to handle the user’s command-line arguments, using a pattern of my-own for forcing a certain discipline for default values, validations and descriptions.

The most important code in the project, by far, is class “QuiosqueCofinaPT”, which does all the impersonation jobs: login, browse to edition, flip the pages, save snapshots, etc.
That class “QuiosqueCofinaPT” depends on another lower-level class of mine named “AmWebDriver”, which directly interfaces with a running instance of Selenium hub, which then controls a running web-browser. Firefox (ESR) is the version in use.

There is an external Scrivener book file “bot_jornaldenegocios.scrivx” which captures the evolution of the project that resulted in this new bot for Blogbot.

Example calls

php  bb_jnegocios1_dl_edition 2021 7 29 4444 #all possible arguments given
php  bb_jnegocios1_dl_edition 2021 7 29 #omits the Selenium driver port, defaults to 4444
php  bb_jnegocios1_dl_edition 2021 7 #omits the day and the port, defaults to current day and port 4444
php  bb_jnegocios1_dl_edition 2021 #omits the month, the day and the port; defaults to current month and day, port 444
php  bb_jnegocios1_dl_edition #omits everything, default to current date and port 4444

Source code (of the dl script only, not of the supporting classes)

<?php
require_once  "./vendor/autoload.php";

use am\util\AmDate;
use am\internet\HttpHelper;
use am\internet\QuiosqueCofinaPT;
use am\console\Console;

define ("THIS_HARVESTER_NAME", "BB 'quiosque.cofina.pt/jornal-de-negocios/' &#91;from&#93; daily edition harvester".PHP_EOL);
define ("THIS_HARVESTER_VERSION", "v20210728 2000".PHP_EOL);

echo THIS_HARVESTER_NAME;
echo THIS_HARVESTER_VERSION;

const MIN_NUMBER_OF_ARGUMENTS_THE_USER_MUST_PROVIDE = 0;
const ARGUMENT_YEAR_INDEX_IN_ARGV = 1;
const ARGUMENT_MONTH_INDEX_IN_ARGV = 2;
const ARGUMENT_DAY_INDEX_IN_ARGV = 3;
const ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV = 4;

const DEFAULT_VALUE_FOR_ARGUMENT_DRIVER_PORT = \am\internet\AmChromeDriver::SELENIUM_HUB_DEFAULT_SERVER_PORT; //default for selenium (do not confuse with chromedriver.exe 9515 port)

// to use AmConsole, one must provide a validation function per possible argument
// in this case, all args can be validated by the same function 'validateIsIntegerGTOE1'
$arrayOfValidationFunctions = &#91;
    ARGUMENT_YEAR_INDEX_IN_ARGV => "validateIsIntegerGTOE1",
    ARGUMENT_MONTH_INDEX_IN_ARGV => "validateIsIntegerGTOE1",
    ARGUMENT_DAY_INDEX_IN_ARGV => "validateIsIntegerGTOE1",
    ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV => "validateIsIntegerGTOE1"
];

// to use AmConsole, one must provide describe every possible argument
$arrayOfDescriptorsOneForEachCommandLineArg = [
    ARGUMENT_YEAR_INDEX_IN_ARGV => "Integer >=1 can be supplied, for year (defaults to system's year).",
    ARGUMENT_MONTH_INDEX_IN_ARGV => "Integer >=1 can be supplied, for month (defaults to system's month).",
    ARGUMENT_DAY_INDEX_IN_ARGV => "Integer >=1 can be supplied, for day (defaults to system's day).",
    ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV => "Integer >=1 expected, for driver port (defaults to 4444).",
];

// to use AmConsole, one must provide describe default values for every possible argument that the user can omit
$strCurrentDate = date("Y-m-d");
$aCurrentDate = explode("-", $strCurrentDate);
$iYear = intval($aCurrentDate[0]);
$iMonth = intval($aCurrentDate[1]);
$iDay = intval($aCurrentDate[2]);
$arrayOfDefaultValues = [
    0 => __FILE__ //always like this, to state this very same script as one argument
    ,
    ARGUMENT_YEAR_INDEX_IN_ARGV => $iYear
    ,
    ARGUMENT_MONTH_INDEX_IN_ARGV => $iMonth
    ,
    ARGUMENT_DAY_INDEX_IN_ARGV => $iDay
    ,
    ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV => DEFAULT_VALUE_FOR_ARGUMENT_DRIVER_PORT
];

//-------------------- VALIDATORS START --------------------

function validateIsIntegerGTOE1 (
    $pInt
) : bool
{
    $iResult = \am\util\Util::toInteger($pInt);
    return $iResult ? $iResult>=1 : false;
}//validateIsIntegerGTOE1

//-------------------- VALIDATORS END --------------------

//----------- ACTION (PROBLEM SPECIFIC) STARTS------------
function action(
    $pConsole
){
    $y = intval($pConsole->mArgv[ARGUMENT_YEAR_INDEX_IN_ARGV]); //if the values that populated the mArgv object are user supplied they'll be strings
    $m = intval($pConsole->mArgv[ARGUMENT_MONTH_INDEX_IN_ARGV]);
    $d = intval($pConsole->mArgv[ARGUMENT_DAY_INDEX_IN_ARGV]);
    $driverPort = intval($pConsole->mArgv[ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV]);

    $bValidDate = \am\util\DateTools::validDay($y, $m, $d);
    if ($bValidDate){
        echo "Valid date received. Will now download the JN publications.".PHP_EOL;
        /*
         * these secrets can be captured on the PHP LOG FILE!
         * TODO: how to avoid this security risk?
         * https://websec.io/2018/06/14/Keep-Credentials-Secure.html
         */
        $o = new QuiosqueCofinaPT(
            SECRET_QUIOSQUE_COFINA_LOGIN_NAME_1,
            SECRET_QUIOSQUE_COFINA_PASSWORD_1,

            $driverPort,
            HttpHelper::USER_AGENT_STRING_CHROME_70
        );
        $loginRet = $o->actionLogin();
        $startDate = new AmDate($y, $m, $d);
        $bIsSunday = $startDate->isSunday();

        if (!$bIsSunday){
            $o->browseDailyEditionAndSnapshotSaveAllPairsOfPages(
                $startDate->mY,
                $startDate->mM,
                $startDate->mD,
                "dls"
            );
        }//if NOT sunday
    }//if valid date
    else{
        echo "Call aborted - please supply a valid date!".PHP_EOL;
    }//else
}//action

//----------- ACTION (PROBLEM SPECIFIC) ENDS------------

/*
 * the __construct constructor of AmConsole throws an Exception when no command line arguments (including no script name) are received
 * PHPSTORM will signal a warning of "unhandled Exception" for the a call without try/catch
 */
try {
    $oConsole = new \am\console\AmConsole(
        $argv,
        $pMinNumberOfArguments = MIN_NUMBER_OF_ARGUMENTS_THE_USER_MUST_PROVIDE,
        $arrayOfDefaultValues,
        $arrayOfValidationFunctions,
        $arrayOfDescriptorsOneForEachCommandLineArg
    );
}//try
catch (Exception $e){
    echo $e->getMessage();
}//catch

echo $oConsole; //a summary of everything received
$c0 = $oConsole->allArgsOK();
if ($c0) action (
    $oConsole
);
else{
    echo "Did NOT call the script, because 1+ argument(s) was not OK.".PHP_EOL;
}

Results
In the end, this bot produces files in an automatically created folder, containing snapshots of the pages. Other tools will OCR and compile the contents together.

File system organization automation – moving thousands of files

I am trying to organize large thousands of files.
The task is not healthy if done by hand; moreover, it is prone to human error. So, I wrote an automated solution, using PHP and some of its file system calls.

The problem is described in the main function comments, here shared in this post. Imagine thousands of folders, each eventually containing a single specially named sub-folder (e.g. “_”) whose contents (files and directories) you want to move to its parent root. This was interesting to write because it shows the power of a language’s file system tools for automation in files organization.

Check the “before” and “after” images for the simplest example I managed to capture in a visual representation.

/*
 * 2019-07-04 started
 * 2019-07-04 first tested ok - test preserved in file "test_20190704.php"
 * learned: https://www.php.net/manual/en/class.splfileinfo.php
 * the idea is to move the contents in <some folder>\_\<contents> to <some folder>\<contents>
 * the "_" is the product of a rename "shortener" tool - it facilitates in avoiding very long paths that result from some jobs
 * however, some archives when unpacked will still have sub-folders inside the _ folder
 * and that organization creates problems to other tools of mine
 * this function should received a folder F path, of a folder which might contain a "_" named sub folder
 * then it will move _'s files and folders to the F's root
 *
 * @param string $pStrFolderPath : some folder path
 * @param string $pStrSpecificNameOfSubFolderWhoseContentsAreToBeMovedToParentFolder : optional, the name of the special folder, defaults to "_"
 * @param bool $pbFeedback : verbose activity, giving feedback to user? Defaults to true
 * @param bool $pbCaseSensitive : case sensitive in checking the folder? Default to false and is not relevant for folders with names like "_"
 * @return int : the number of successful ren/move operations
 */
function moveEverythingInFolderWithSpecificNameIfItExistsInPathToItsParentFolder (
    string $pStrFolderPath,
    string $pStrSpecificNameOfSubFolderWhoseContentsAreToBeMovedToParentFolder = "_",
    $pbFeedback=true,
    $pbCaseSensitive = false
)
{
    $bPathExists=file_exists($pStrFolderPath);
    $iSuccessfulRenamesCounter=$iFailedRenames=0;

    if ($pbFeedback){
        feedbackOnFunctionCall(
            __FUNCTION__,
            func_get_args()
        );
    }//if

    if ($bPathExists) {
        //dirsInDir does NOT return the dots dirs . and ..
        $dirsInDir = dirsInDir(
            $pStrFolderPath,
            false //not recursive
        );
        $iHowManyDirsInDir = count($dirsInDir);
        $bRightNumberOfDirs = ($iHowManyDirsInDir === 1); // $pStrSpecificNameOfSubFolderWhoseContentsAreToBeMovedToParentFolder

        //there should be only 1 dir if the move op is to be safe in preserving the contents' hierarchy
        if ($bRightNumberOfDirs) {
            $oSingleDirItemObject = $dirsInDir[0];
            $strDirName = $oSingleDirItemObject["fname"];

            $bFolderHasTheSearchedForName =
                $pbCaseSensitive ?
                    //sensitive
                    strcmp(
                        $strDirName,
                        $pStrSpecificNameOfSubFolderWhoseContentsAreToBeMovedToParentFolder
                    ) === 0

                    :
                    //insensitive
                    strcasecmp(
                        $strDirName,
                        $pStrSpecificNameOfSubFolderWhoseContentsAreToBeMovedToParentFolder
                    ) === 0;

            $oParentFolderPath = new SplFileInfo ($pStrFolderPath);
            $strParentDirectoryFullPath = $oParentFolderPath->getRealPath();

            if ($bFolderHasTheSearchedForName) {
                $strOldLocationOfSearchedFolder = $oSingleDirItemObject["realPath"];

                //dirItems is a function of mine, returning an assoc array representing each dir in a path, except the dot dirs (. and ..)
                $aItemsInSearchedFolder = dirItems(
                    $strOldLocationOfSearchedFolder,
                    "*",
                    false //not recursive
                );

                foreach ($aItemsInSearchedFolder as $itemAsAssocArray) {
                    $strOldLocationOfSearchedFolder = $itemAsAssocArray["realPath"];
                    $strItemName = $itemAsAssocArray["fname"];
                    $strNewLocationForSearchedFolderContents = $strParentDirectoryFullPath . "\\" . $strItemName;

                    $bRenameMoveItemOK = rename(
                        $strOldLocationOfSearchedFolder,
                        $strNewLocationForSearchedFolderContents
                    );
                    if ($bRenameMoveItemOK) {
                        $iSuccessfulRenamesCounter++;
                    }//if ok
                    else {
                        echo "Failed in moving $strOldLocationOfSearchedFolder to $strNewLocationForSearchedFolderContents" . PHP_EOL;
                        $iFailedRenames++;
                    }//else
                }//foreach item in folder

                $strFullPathOfSearchedFolder = $oSingleDirItemObject['realPath'];
                $aItemsInSearchedFolder = dirItems($strFullPathOfSearchedFolder);
                $bEmptyFolder = count($aItemsInSearchedFolder)===0;
                if ($bEmptyFolder){
                    $rmResult = rmdir($strFullPathOfSearchedFolder); //return not used, but it is a bool
                }//if
            }//if found the searched folder
        }//if there is exactly 1 folder in the path
    }//if path exists

    return $iSuccessfulRenamesCounter;
}//moveEverythingInFolderWithSpecificNameIfItExistsInPathToItsParentFolder



mover_after.png
https://arturmarques.com/wp/wp-content/uploads/2019/07/mover_after.png (image/png)

mover_after.png


mover_before.png
https://arturmarques.com/wp/wp-content/uploads/2019/07/mover_before.png (image/png)

mover_before.png

Technical Details

A quick solution to check if a string ends with…

Today I was coding a PHP solution to remove tracking data from URLs. To help in doing so, my approach to the problem requires being able to check if a string ends in another. PHP doesn’t seem to have a prebuilt “endsWith” function.
So, I wrote an auxiliary tool to do just that (check if a string ends in another), dumped it as static method to an “Util” class of mine (being static, no need to instantiate the class to use it), and gave it the bonus of being able to do the checking in a case sensitive fashion, or not.

    //_.-^-._.-^-._.-^-._.-^-._.-^-._.-^-._.-^-._.-^-._.-^-._.-^-._.-^-._.-
    /*
     * 2019-07-03
     * @param string $pStr : a string
     * @param string $pStrTermination : a string to be tested as a termination of the first
     * @return bool : true if indeed $pStrTermination is at the end of $pStr
     */
    public static function auxCheckIfStringsEndsWith(
        string $pStr, //some string
        string $pStrTermination, //some string that eventually is at the end of $pStr
        bool $pbCaseSensitive = false //by default, don't be case sensitive in checking
    ){
        $iLengthOfString = strlen($pStr);
        $iLengthOfTermination = strlen($pStrTermination);
        $bItCanBeATermination = $iLengthOfString>=$iLengthOfTermination;

        if ($bItCanBeATermination){
            $iPosWhereEventualTerminationStarts =
                $pbCaseSensitive ?
                strpos($pStr, $pStrTermination) //case sensitive checking
                :
                stripos($pStr, $pStrTermination); //case insensitive checking

            if ($iPosWhereEventualTerminationStarts!==false){
                //termination exists, but is it at the exact end of string?
                $bAtTheExactEnd = $iPosWhereEventualTerminationStarts === $iLengthOfString - $iLengthOfTermination;
                return $bAtTheExactEnd;
            }//if
        }//if
        return false;
    }//auxCheckIfStringsEndsWith