Private personal pet project: downloader for "Jornal de Negócios"

bot “bb_jnegocios1_dl_edition.php”

Purpose
This is a tool to download a single date-specific edition of “Jornal de Negocios” – a very good newspaper on business and markets, focused on the Portuguese context – as available from
https://quiosque.cofina.pt/jornal-de-negocios/
or
https://quiosque.cofina.pt/jornal-de-negocios/yyyymmdd

For example:
https://quiosque.cofina.pt/jornal-de-negocios/20210729

This is content only available to subscribers. I am a longtime subscriber, but I very much prefer to have all the content offline and compiled together, for me to consume whenever I want, regardless of internet connection availability. Publishers usually do NOT provide this level of control over the contents, so I have to write my own tools. This post is a glimpse on one of the tools.

Related projects of my own

This depends on my AmConsole class, to handle the user’s command-line arguments, using a pattern of my-own for forcing a certain discipline for default values, validations and descriptions.

The most important code in the project, by far, is class “QuiosqueCofinaPT”, which does all the impersonation jobs: login, browse to edition, flip the pages, save snapshots, etc.
That class “QuiosqueCofinaPT” depends on another lower-level class of mine named “AmWebDriver”, which directly interfaces with a running instance of Selenium hub, which then controls a running web-browser. Firefox (ESR) is the version in use.

There is an external Scrivener book file “bot_jornaldenegocios.scrivx” which captures the evolution of the project that resulted in this new bot for Blogbot.

Example calls

php  bb_jnegocios1_dl_edition 2021 7 29 4444 #all possible arguments given
php  bb_jnegocios1_dl_edition 2021 7 29 #omits the Selenium driver port, defaults to 4444
php  bb_jnegocios1_dl_edition 2021 7 #omits the day and the port, defaults to current day and port 4444
php  bb_jnegocios1_dl_edition 2021 #omits the month, the day and the port; defaults to current month and day, port 444
php  bb_jnegocios1_dl_edition #omits everything, default to current date and port 4444

Source code (of the dl script only, not of the supporting classes)

<?php
require_once  "./vendor/autoload.php";

use am\util\AmDate;
use am\internet\HttpHelper;
use am\internet\QuiosqueCofinaPT;
use am\console\Console;

define ("THIS_HARVESTER_NAME", "BB 'quiosque.cofina.pt/jornal-de-negocios/' &#91;from&#93; daily edition harvester".PHP_EOL);
define ("THIS_HARVESTER_VERSION", "v20210728 2000".PHP_EOL);

echo THIS_HARVESTER_NAME;
echo THIS_HARVESTER_VERSION;

const MIN_NUMBER_OF_ARGUMENTS_THE_USER_MUST_PROVIDE = 0;
const ARGUMENT_YEAR_INDEX_IN_ARGV = 1;
const ARGUMENT_MONTH_INDEX_IN_ARGV = 2;
const ARGUMENT_DAY_INDEX_IN_ARGV = 3;
const ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV = 4;

const DEFAULT_VALUE_FOR_ARGUMENT_DRIVER_PORT = \am\internet\AmChromeDriver::SELENIUM_HUB_DEFAULT_SERVER_PORT; //default for selenium (do not confuse with chromedriver.exe 9515 port)

// to use AmConsole, one must provide a validation function per possible argument
// in this case, all args can be validated by the same function 'validateIsIntegerGTOE1'
$arrayOfValidationFunctions = &#91;
    ARGUMENT_YEAR_INDEX_IN_ARGV => "validateIsIntegerGTOE1",
    ARGUMENT_MONTH_INDEX_IN_ARGV => "validateIsIntegerGTOE1",
    ARGUMENT_DAY_INDEX_IN_ARGV => "validateIsIntegerGTOE1",
    ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV => "validateIsIntegerGTOE1"
];

// to use AmConsole, one must provide describe every possible argument
$arrayOfDescriptorsOneForEachCommandLineArg = [
    ARGUMENT_YEAR_INDEX_IN_ARGV => "Integer >=1 can be supplied, for year (defaults to system's year).",
    ARGUMENT_MONTH_INDEX_IN_ARGV => "Integer >=1 can be supplied, for month (defaults to system's month).",
    ARGUMENT_DAY_INDEX_IN_ARGV => "Integer >=1 can be supplied, for day (defaults to system's day).",
    ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV => "Integer >=1 expected, for driver port (defaults to 4444).",
];

// to use AmConsole, one must provide describe default values for every possible argument that the user can omit
$strCurrentDate = date("Y-m-d");
$aCurrentDate = explode("-", $strCurrentDate);
$iYear = intval($aCurrentDate[0]);
$iMonth = intval($aCurrentDate[1]);
$iDay = intval($aCurrentDate[2]);
$arrayOfDefaultValues = [
    0 => __FILE__ //always like this, to state this very same script as one argument
    ,
    ARGUMENT_YEAR_INDEX_IN_ARGV => $iYear
    ,
    ARGUMENT_MONTH_INDEX_IN_ARGV => $iMonth
    ,
    ARGUMENT_DAY_INDEX_IN_ARGV => $iDay
    ,
    ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV => DEFAULT_VALUE_FOR_ARGUMENT_DRIVER_PORT
];

//-------------------- VALIDATORS START --------------------

function validateIsIntegerGTOE1 (
    $pInt
) : bool
{
    $iResult = \am\util\Util::toInteger($pInt);
    return $iResult ? $iResult>=1 : false;
}//validateIsIntegerGTOE1

//-------------------- VALIDATORS END --------------------

//----------- ACTION (PROBLEM SPECIFIC) STARTS------------
function action(
    $pConsole
){
    $y = intval($pConsole->mArgv[ARGUMENT_YEAR_INDEX_IN_ARGV]); //if the values that populated the mArgv object are user supplied they'll be strings
    $m = intval($pConsole->mArgv[ARGUMENT_MONTH_INDEX_IN_ARGV]);
    $d = intval($pConsole->mArgv[ARGUMENT_DAY_INDEX_IN_ARGV]);
    $driverPort = intval($pConsole->mArgv[ARGUMENT_DRIVER_PORT_INDEX_IN_ARGV]);

    $bValidDate = \am\util\DateTools::validDay($y, $m, $d);
    if ($bValidDate){
        echo "Valid date received. Will now download the JN publications.".PHP_EOL;
        /*
         * these secrets can be captured on the PHP LOG FILE!
         * TODO: how to avoid this security risk?
         * https://websec.io/2018/06/14/Keep-Credentials-Secure.html
         */
        $o = new QuiosqueCofinaPT(
            SECRET_QUIOSQUE_COFINA_LOGIN_NAME_1,
            SECRET_QUIOSQUE_COFINA_PASSWORD_1,

            $driverPort,
            HttpHelper::USER_AGENT_STRING_CHROME_70
        );
        $loginRet = $o->actionLogin();
        $startDate = new AmDate($y, $m, $d);
        $bIsSunday = $startDate->isSunday();

        if (!$bIsSunday){
            $o->browseDailyEditionAndSnapshotSaveAllPairsOfPages(
                $startDate->mY,
                $startDate->mM,
                $startDate->mD,
                "dls"
            );
        }//if NOT sunday
    }//if valid date
    else{
        echo "Call aborted - please supply a valid date!".PHP_EOL;
    }//else
}//action

//----------- ACTION (PROBLEM SPECIFIC) ENDS------------

/*
 * the __construct constructor of AmConsole throws an Exception when no command line arguments (including no script name) are received
 * PHPSTORM will signal a warning of "unhandled Exception" for the a call without try/catch
 */
try {
    $oConsole = new \am\console\AmConsole(
        $argv,
        $pMinNumberOfArguments = MIN_NUMBER_OF_ARGUMENTS_THE_USER_MUST_PROVIDE,
        $arrayOfDefaultValues,
        $arrayOfValidationFunctions,
        $arrayOfDescriptorsOneForEachCommandLineArg
    );
}//try
catch (Exception $e){
    echo $e->getMessage();
}//catch

echo $oConsole; //a summary of everything received
$c0 = $oConsole->allArgsOK();
if ($c0) action (
    $oConsole
);
else{
    echo "Did NOT call the script, because 1+ argument(s) was not OK.".PHP_EOL;
}

Results
In the end, this bot produces files in an automatically created folder, containing snapshots of the pages. Other tools will OCR and compile the contents together.

How to download courses from Coursera, in 2021

To download COURSERA.ORG courses one subscribes to, either one writes its own bot, which will have to solve the authentication challenge and be able to crawl, identify and fetch all the relevant course files, or one learns to use the “COURSERA-DL” free and open source project (FOSS), mostly written in the language Python, available from:
https://github.com/coursera-dl/coursera-dl/

The first option is great for learning the correspondent skills, but it is hard work.

The second option is immediately available and is much more sensible for instantaneous results, mainly for those who are only focused in getting the course materials, for offline studying.

This post is about installing and using COURSERA-DL. The post assumes “Python” is properly installed. The commands shown were tested on a Python installation on Windows 10.

To install or update COURSERA-DL, the following sequence of commands will work. Enter the commands from any command-line console (CMD.EXE on Windows). Even if COURSERA-DL is already installed, it will remain so, keeping its configuration, and it will only be updated. The commands go a bit beyond COURSERA-DL, because I also care about EDX courses.
One project similar to COURSERA-DL is EDX-DL, for courses at EDX.ORG. Both learning sites have materials on YOUTUBE.COM, so yet another related FOSS is YOUTUBE-DL.

python -m pip install --upgrade pip
pip install --upgrade coursera-dl
pip install --upgrade edx-dl
pip install --upgrade youtube-dl

Once these FOSS solutions are made available on the system, they can be called from the command-line.

To know the technical name of a COURSERA.ORG course, pay attention to its URL, when learning in a browser session. For example, when starting to learn the Coursera course named “Build a Modern Computer From First Principles”, the URL is
https://www.coursera.org/learn/build-a-computer/home/welcome

The technical name is “build-a-computer“, i.e., the string after “https://www.coursera.org/learn/” and before the subsequent forward-slash (“/”). This parsing rule should work for any course.

To download a COURSERA.ORG course named “XPTO”, logging-in as “user@email.com”, having password “1234”, in theory, it should suffice to launch a command-line window (CMD.EXE on any Windows) and enter:

coursera-dl -u "user@email.com" -p "1234" "XPTO"

These days, this will probably FAIL, due to the introduction of CAPTCHAS which defeat many bots.

As of February 2021, COURSERA-DL does NOT defeat the COURSERA CAPTCHA, about picking images which solve some challenge. Defeating CAPTCHAs can be quite a project on its own, so it is understandable that this is happening. The workaround is easy, but not automatable.

For each COURSERA.ORG course you are subscribed to, when you use a web browser to learn it, a cookie named “CAUTH” for domain “.coursera.org” is created on the local computer. In my case, I always use Firefox and the extension “cookie quick manager”, to see the cookies for domains. Using that extension, or equivalent, just observe, text-select, and copy the string value for the CAUTH cookie, which can be a long string (hundreds of chars).

Then, provide the value of that string upon calling COURSERA-DL:

coursera-dl -u "user@email.com" -p "1234" "XPTO" -ca "hundreds of chars go here"

That is it.
For a better workflow, find the folder where the Python script for coursera-dl is; i.e. search for the local file “coursera-dl.py“.

If you have Python installed at

c:\python

the file will be at

c:\python\scripts

In the scripts folder, create a NEW text file named “coursera.conf“, consisting of the sensitive data and other eventual arguments you can learn about by reading COURSERA-DL’s documentation.

For example:

-u "user@email.com" -p "1234" --subtitle-language en --download-quizzes

The text above is the content inside the text file “coursera.conf“, saved in the same folder that contains the coursera-dl.py script.

Now, to download course “XPTO”, just do:

coursera-dl "XPTO" -ca "hundreds of chars go here"

The outdoor sky/clouds have joined my plants stream

I decided to add a 5th camera to the live stream of my plants (not) growing. This new camera captures the outdoor sky/clouds, and serves as a natural reference to what time of day is it, since I do not overlay any date or time indication in the sources. As I write, it is dark outside – not the best timing :).

For now, the stream is available on Twitch: https://www.twitch.tv/arturmarques_dot_com.

In the past, instead of a live stream, my option was to build time-lapse videos. To assist in the process, I coded solutions that build automatic time-lapse videos from images datasets, with configurable quality. When using these tools, I usually build 24 hours videos, but I could request the output of a larger or shorter time span – for example, I have enough material to construct months-long files. The key reason why I have not been doing so, is that I have moved much of the raw data to the cloud, which is not as instantaneously readable, as local physical volumes. When I started playing with these media and doing these easy, fun, observations, one key reward was being able to promptly unveil whatever had happened in the past x hours.

I will adapt my solutions to the new cloud storage and automate the process again. Until then, the live stream should be available with some regularity.

URLs "p1" – 89 resources

I am an avid WWW surfer, with hundreds of websites visited each month, sometimes daily. I bookmark them all, at least for logging purposes. These posts having the "urls" category, capture what was on my browser on a specific date. I hope you enjoy some of these shared resources.


Listening to Kelpe – "Ex-Aquarium"

Forget about conventional “power music”. This is it! Contained, yet systematically “growing”, not in beats-per-unit-of-time, nor in a linear fashion, but, overall, in stage size and/or “crafting” of a particular audio ambiance; effectively embracing, even invading of the listener’s attention, sometimes releasing after peaks.
Highly captivating music, combining instruments, as simple as single chords and basic drum plates, with laboriously thought, felt, loved!, musical environments.

Congratulations to “Kelpe”, Kel McKeown! This particular “Ex-Aquarium” (2008) album I am listening to, is a wonderful and intelligent creation. I am glad I found it – pay special attention to track #2, “Whirlwound”.

youtube-dl – an absurd, sad situation

What follows is my briefest introduction to “copyright”, as I limitedly understand it, followed by my personal thoughts on yesterday’s RIIA initiated DMCA takedown of the project “youtube-dl” from github.com.

The full request for the takedown of “youtube-dl”, and many of its forks, is at
https://github.com/github/dmca/blob/master/2020/10/2020-10-23-RIAA.md

Intro
“Copyright”, from an economic perspective, is a set of monopolies given to creators, to incentivize “creation”. The rational for these incentives is that creation is hard, failure-prone, and copying is relatively trivial.
However, if these monopolies were excessive, for example lasting “forever”, then creators, their heirs, or to whom the rights/monopolies were sold, would constitute a permanent bottleneck between the creation and the opportunity for society, as a whole, to benefit from it, with unrestricted freedom. Thus, the monopolies are time-limited – they have an expiration date.

There are also exceptions in the law. In the USA, “fair use” is the chapter to read, to understand exceptions. For example, showing a copyrighted video to a class of students, in an educational context, will likely be valid.
In Europe, exceptions are similar – international treaties signed by most countries have “harmonized” national copyright systems -, but include explicitly enunciated use-cases, that bypass the chance of litigation and will not require a judge to interpret particular situations as “fair use” or not, namely some learning acts at public libraries.

The creator alone has the exclusive rights to decide who/what can be done with the creation; if/how it can be modified, and if/how it can be distributed. Societies, in the so-called “Knowledge Economy” we are living in, will mostly progress fueled by better knowledge, so creators are the professionals that modern societies need and “Copyright” law must keep evolving to keep the proper balance between the creators’ rewards and the societal benefits.

The “DMCA” (Digital Millennium Copyright Act) is one of the many changes that Copyright law incorporated, in the USA. However, it is an ugly one, because until year 2000 clean reverse engineering practices would probably be legit, and since then, if for bypassing certain TPMs (Technology Protection Measures) that can “compromise” the creator, they might not be.

My thoughts
Clean reverse engineering practices usually are extraordinary innovations and should not be barred. The perverse effect of making certain TPM-defeating processes illegal, even when identified cleanly, with absolutely no access to the source intellectual property, is that the knowledge of the available bypasses will rest in the hands of the very few who do manage to find them. The chance for improvement is lost and asymmetries intensify, with solutions only available to few, definitely not available to the entire community, leaving most under the false believe that the current fruition model is the single possible one. This has fueled “bug bounty” programs, thus contributing to alternative reward systems.

These are very hard topics to discuss lightly, and this post sure is light. But, right now, I find it very negative, wrong on many levels – economical and intellectual -, damaging for all in the long-run, and intensely disrespectful for the thinkers, writers and coders involved, that RIIA is attacking years of hard labored source code developed by a community of intellectuals.

The “youtube-dl” source code has probably done nothing more than to promote the exact same artists that, allegedly, are being hurt by it. This is truly unfair. Have common sense! Some of the referenced artists themselves should take a good look at the mirror and try to assess if these tools are taking food out of their tables – what they are indirectly doing, is taking the creation pleasure out of the lives of innocents, who just enjoy creating software. Have some decency. Live and love, and let live and love.

I also tweeted about this:
https://twitter.com/my_dot_com/

Prost’s 1988 McLaren F1 @Algarve = 01:27:594

This is a 01:27:594 lap around the Algarve track, racing Alain Prost’s 1988 F1 McLaren. As I write, the lap record is only 11 seconds faster, in a F1 2020 car.

From 2020-10-23 to 2020-10-25, the Formula 1 Championship is at a new racing circuit, where F1 cars have never raced before: the “Autódromo Internacional do Algarve”, in Portimão, Algarve, Portugal. F1 never officially raced, but did test there, in the past.

In a season so competitively poor and lacking dispute for the wins, the interest is beyond the podium. Tracks like Algarve’s are a very welcome addition to the calendar, not just because they are new, but mainly because they are different: in this case, the layout brings variance in the Z-axis. Cars go up and down, frequently! Corners are blinder and wider than usual, allowing and even inviting alternative trajectories, enabling a human-factor not so evident in other locations. I am enjoying it! It is unique and F1 needs variables that can contribute to less predictable race results.

I decided to try it myself, racing Alain Prost’s 1988 McLaren F1.
I have also upped my simulator’s resolution, from 2560×1600 to 3440×1440. The wider ratio is more immersive. I changed for productivity reasons, not expecting gaming benefits, but they are there.

Here is a video of a 01:27:594 (minutes:seconds:milliseconds) lap, using rFactor 2. Contrary to many, I never found the sound of this car’s Honda engine particularly enjoyable or spectacular. In-car, the noise is too regular, providing relatively poor acoustic queues for when to shift gears, up or down. Modern F1 cars literally beep the drivers when it is time to up-shift. This car also had no speed limiter and no driver-assists, and that is good.
I find the McLaren heavy, high down-force, trustable. That is its key positive attribute: it is predictable – after a short time, you know how it will behave, except when on the limit on old tires, when it becomes less clear how the tire wear will condition outcomes.
I dislike the slow gearbox and there is nothing the driver can do, to compensate it: the setup only allows different gear ratios.

Regarding the track itself, it is ever-changing in altitude, and challenging to the left-front tire under braking, because there are two right-corners which require heavy braking while not in a straight line.

The video has two segments: the first ~90 seconds are captured from in-car, exactly as seen, when playing. The second half is footage from the “TV” camera. Enjoy!

My plants, (not) growing, LIVE! On Twitch

For long, I have been having fun with timelapse videos of my plants growing. Some of those videos have reached
this blog @ https://arturmarques.com/wp ;
and
my main youtube channel @ https://www.youtube.com/channel/UCUa0DzKskGo0iRYP8QzWsvA/

Today, I decided to combine the streams from four of the cameras that take the raw snapshots for the timelapse videos, into a single video feed. I am putting it live on Twitch:

@ https://www.twitch.tv/arturmarques_dot_com

It is live from my place, but it is all fully automatic. I am not monitoring the Twitch room and feedback, if any. It is an experiment, to check how reliable the stream can be, and what kind of interactions it can ignite on self-motivated spectators.

Four of the vases – the ones on upper-right corner of the video stream – are trying to grow Chili seeds, since yesterday. I read that it is hard to do it, with temperatures below 30. The temperature in the room where the plants are, is below that, so there is a high chance of failure. It is very probable that nothing will happen for many days.

The plants are on artificial light for ~8 hours/day. The rest of the time, they live with whatever ambient light the room gets, which can be very irregular. When the room gets dark enough, the plants will remain visible with infrared lighting.
In an effort to have something on camera, due to the difficulty in growing Chili, I have also planted Dill and Mint.

Enjoy, if you can.