Jul 22, 2023 - 9 minute read.

JavaJotter - Data Harvesting and Storage

Welcome back to this multi-part series about a small project I’ve been working on called JavaJotter. If you haven’t already, go check out part 1 here. This post explores the first two components of the project, namely, the data harvester or scrapper and the database backend.

Data Harvester #

The data harvester was built on four key principles: maintainability, extensibility, rapid development, and ease of testing. The goal was to develop a tool that was not just efficient at data collection, but also simple to maintain and expand, quick to build, and straightforward to test.

We chose a suite of relatively standard tools to achieve this. The data harvester is programmed in C# 11. C# provides a robust programming environment with cross-platform capabilities through .NET Core. It also grants access to high-quality libraries such as AutoFac, SlackNet, and .NET data providers.

In our system design, Dependency Injection (DI) is a crucial principle. For those not acquainted with the term, DI is a design approach in object-oriented programming where an object receives other objects it depends on. Instead of having an object compose its dependencies or request a factory object to generate one, the required dependencies are provided to the object from the outside, rendering it an external responsibility.

The importance of Dependency Injection lies in its power to eliminate hard-coded dependencies among objects, thereby making the system more modular and maintainable. This pattern carries several benefits:

Testability: Testing objects becomes a lot easier when dependencies are provided externally. For instance, you can supply mock dependencies to track interactions or assert states.
Code Reuse: DI aids in writing reusable code. Given that components are decoupled, they can be reused in different areas of the application.
Boilerplate Code Reduction: The DI design pattern significantly reduces boilerplate code, as the initialization of dependencies is handled by the injector component.
Improved Code Readability: By segregating initialization and factory code from the business logic, code becomes much cleaner and easier to interpret.

However, caution should be exercised when implementing DI, as overuse or incorrect use can make the code flow hard to follow. It’s also important to note that learning and integrating a Dependency Injection framework can add an extra layer of complexity to the codebase. But fear not, the potential benefits far outweigh the initial learning curve.

The AutoFac library plays a central role in our DI setup, acting as the IoC (Inversion of Control) container that manages object creation and dependency injection, further enhancing testability and maintainability.

SlackNet is a .NET library used for creating Slack apps with ease. It’s designed to simplify the process of building and deploying Slack apps, handling many of the intricacies involved in working with the Slack API. In our project, SlackNet is employed to seamlessly connect our data harvester to the Slack environment as a fully functional Slack app. It pulls all the messages from channels where our bot has been added, facilitating our data collection process.

Roll messages are distinct from regular messages on Slack. Firstly, when they are returned by the Slack API, they appear to have no sender (otherwise we could’ve just checked to see who sent the message). Secondly, roll messages have an empty string in the place where the body text of a message is usually stored. Rather, the associated text for each roll message is stored as a text attachment. Think of it as a text file that’s been attached to the message. Thirdly, each roll message attachment takes on the form of <@USERID> rolled *0*, where @USERID is a unique slack server user ID and 0 is any numerical value from 0 to 100.

The data collection process is straightforward. We request all messages from channels where the bot is added, starting from the time of the last recorded database entry up until now. Messages are filtered to identify valid rolls. Relevant data from valid coffee rolls—such as the user tag, timestamp, and roll values are then normalized, batched, and uploaded to the database.

To detect roll messages, we first check if the message body is empty (no text). If the message body is empty, we proceed to check if the message has a text attachment that matches the usual roll pattern. For ease of development, I’ve opted to use regular expressions for pattern matching to detect the roll messages and to extract pertinent information.

Regular expressions, also known as regex, are a powerful tool in the programmer’s toolkit. In essence, they are sequences of characters that form a search pattern. These patterns can then be used to match, locate, and manage text. Regex is supported by many programming languages, including C#, although they may differ per implementation. Regex provides an efficient way to handle text data and is heavily utilized in tasks such as input validation, data scraping, data wrangling, string parsing, and complex string replacement.

For this project, we used three primary regex patterns:

Pattern 1: Broad Format #

<@[A-Za-z0-9]+>\\s+rolled\\s+\\*\\d+\\*

This pattern matches any string where a user tag is followed by the word “rolled”, followed by a number enclosed in asterisks. Here’s a breakdown:

<@[A-Za-z0-9]+> : This portion matches a user tag. The ‘@’ symbol typically precedes user tags in platforms like Slack. The square brackets ([]) contain a range that matches any upper or lower case letter or digit. The ‘+’ means “one or more of the preceding elements.”
\\s+rolled\\s+ : This portion looks for the word “rolled” surrounded by one or more whitespace characters (denoted by ` \s+ `).
` \\d+\ : This portion matches a number enclosed in asterisks. The \d+ ` matches one or more digits, and the asterisks (\*) are escaped with double slashes (\\) because an asterisk is a special character in regex.

If we take a sample input string like “<@U0G9QF9C6> rolled *12*”, this regex pattern will match it.

Pattern 2: User Tags #

<@([A-Za-z0-9]+)>

This pattern matches any user tag, an alphanumeric string enclosed in <@ and >. The parentheses here are used to create a capture group, which means “whatever matches, remember it.” For instance, if we have a user tag like “<@U0G9QF9C6>”, this pattern will match it and remember “U0G9QF9C6”.

Pattern 3: Roll Value #

\\*(\\d+)\\*

This pattern matches any number enclosed in asterisks. Similar to the second pattern, it uses parentheses to remember the number that matches. If we have “*12*” in our text, it will match and remember “12”.

These regex patterns play a crucial role in filtering out the necessary data from the unstructured text, enabling our data harvester to to effectively filter and identify relevant information from the gathered data.

WebSockets #

While somewhat removed from the topic of data harvesting, a really cool feature of the Slack API is its handling of Real Time Messaging (RTM) via WebSockets. A WebSocket provides a long-lived connection for exchanging messages between client and server where messages may flow in either direction for full-duplex communication.

In our project, the WebSocket connection comes into play as it enables real-time data exchange and reduces the number of handshakes we need to make when requesting large numbers of messages (improving our performance). When our harvester is running, it maintains an open WebSocket with Slack’s servers. This enables an immediate two-way exchange of information, such as continued polling of historic messages. Additionally, as soon as a message appears in a channel, our app is notified, which allows us to expand in the future to near real-time responsiveness – updating the data dashboard on a roll-by-roll basis

However, the importance of WebSockets really shines when it comes to running apps behind firewalls. Traditional HTTP connections can be impeded by firewall restrictions, particularly when an inbound connection to a server is necessary. However, WebSocket connections are initiated with an HTTP handshake, which is compatible with virtually all existing firewall rules that allow web traffic. Once this handshake is complete, the connection upgrades to a WebSocket protocol connection, which retains the original source and destination ports. The firewall continues to see this as an established HTTP connection, even though it’s now a full-duplex WebSocket connection. As a result, we can continue to receive real-time updates from Slack, even when behind a restrictive firewall.

The Database #

We chose PostgreSQL for our database management system. PostgreSQL is an open-source object-relational database system with over 30 years of active development. Its proven architecture has earned a solid reputation for reliability, data integrity, and accuracy. It is standards-compliant and programmable. While this project primarily takes advantage of its relational support (SQL), PostgreSQL also supports non-relational JSON querying.

PostgreSQL offers many unique features. It includes a rich set of native data types like JSON, XML, and arrays, and has full-text search capabilities. It also supports indexing, transactions, and subqueries. Moreover, PostgreSQL has numerous extensions and libraries that provide additional functionality, making it highly extensible and customizable. We’ve taken advantage of one such extension called PostgREST to create our Restful API, which will be discussed in the next blog post. All things considered, however, Its strong community, comprehensive documentation, and adherence to standards make it an excellent choice for complex, robust applications.

For our database, we’ve built out three primary tables: ‘usernames’, ‘channels’, and ‘rolls’.

The SQL query to create the ‘usernames’ table is:

CREATE TABLE IF NOT EXISTS usernames (
id SERIAL PRIMARY KEY,
slack_id TEXT UNIQUE NOT NULL,
username TEXT);

The ‘usernames’ table stores each lab user’s unique ‘slack_id’ and ‘username’. The ‘id’ column is the primary key, guaranteeing the uniqueness of each record.

The ‘channels’ table is created using the SQL query:

CREATE TABLE IF NOT EXISTS channels (
id SERIAL PRIMARY KEY,
slack_id TEXT UNIQUE NOT NULL,
channel_name TEXT);

The ‘channels’ table records the unique ‘slack_id’ and ‘channel_name’. Like the ‘usernames’ table, the ‘id’ column is the primary key.

The ‘rolls’ table is created with this query:

CREATE TABLE IF NOT EXISTS rolls (
unix_milliseconds BIGINT NOT NULL,
channel_id INTEGER NOT NULL,
user_id INTEGER NOT NULL,
dice_value INTEGER NOT NULL,
FOREIGN KEY (user_id) REFERENCES usernames(id),
FOREIGN KEY (channel_id) REFERENCES channels(id),
PRIMARY KEY (unix_milliseconds, channel_id)
);

The ‘rolls’ table captures crucial details about each roll. It includes the timestamp (‘unix_milliseconds’), ‘channel_id’, ‘user_id’, and the crucial ‘dice_value’ - the result of the roll that determines the coffee duties for the next day. We establish links to the ‘usernames’ and ‘channels’ tables using ‘user_id’ and ‘channel_id’ as foreign keys. Therefore, the ‘rolls’ table records the who, where, when, and the result of each coffee roll in a coherent, linked narrative.

The concept of normalization is a critical aspect of an efficient database design. Established by Edgar F. Codd, the father of the relational model, normalization is a process that helps eliminate redundant data, ensure data dependencies make sense, and reduce the complexities of data update operations.

In essence, normalization is a technique used to minimize data redundancy and prevent data anomalies by dividing the data into separate but related tables. Each table should be about a specific topic, reducing the amount of duplicate data stored in the database.

There are several levels of normalization, each called a “normal form.” The normal forms are hierarchical, meaning a database can’t be in the third normal form (3NF) without first being in the second and first normal forms.

Our database design adheres to the Third Normal Form (3NF). To understand 3NF, let’s briefly touch upon the first two forms:

First Normal Form (1NF) ensures that each column in a table has a single value. It eliminates duplicate data in the same row, and each column value is atomic.
Second Normal Form (2NF) is only applicable to tables with composite primary keys. It demands that each non-key column be functionally dependent on the whole primary key.
Third Normal Form (3NF), our design target, goes a step further. It insists that non-key columns must depend on the primary key and nothing but the key. This design principle eliminates transitive dependencies, where a non-key field depends on another non-key field.

By adhering to 3NF, we ensure that each non-key column is directly dependent on the primary key, and only on the primary key. The primary key uniquely identifies a record, so this level of normalization ensures we’re not storing unnecessary duplicate information. Additionally, by reducing redundancy, 3NF helps to conserve storage space and to keep the database tidy and efficient.

With these tables, it’s straightforward to add, delete, and modify records without accidental data loss. Furthermore, by reducing redundancy, the database becomes more flexible, simplifies the complexity of revising data, and provides a quicker response to data queries. This is because updates, deletions, or insertions will typically touch fewer tables, and therefore, fewer records.

Ultimately, normalization is an essential part of database design that helps to maintain data integrity and facilitate efficient data manipulation and retrieval.

The SQL queries also include the creation of indices on the ‘rolls’ table:

CREATE INDEX IF NOT EXISTS rolls_user_id_index ON rolls (user_id);
CREATE INDEX IF NOT EXISTS rolls_channel_id_index ON rolls (channel_id);
CREATE INDEX IF NOT EXISTS rolls_unix_milliseconds_index ON rolls (unix_milliseconds);

Indices, much like the index section in a large book, can significantly enhance the performance of our database queries. By constructing an index on a specific column, such as ‘user_id’, the database engine can rapidly navigate to the required data, similar to how you would quickly locate a topic in a book by referring to the index.

Without an index, the database would have to perform a “full table scan”, which is as laborious as it sounds. It would need to read every row in the table, one by one, to find the matching records—akin to reading a book cover to cover just to find a specific topic.

However, when an index is available, the database can take a more efficient route. It can use the index to narrow down the potential locations of the desired data, substantially reducing the number of rows it has to examine. The concept is not unlike flipping directly to the right chapter in a book, saving you the time of scanning through unrelated pages.

Additionally, if you want to locate rows that match a certain value, such as a specific ‘user_id’, an index can be incredibly valuable. Rather than having to wander aimlessly through the ‘forest’ of data rows, the database engine can use the index as its guide, reaching the desired ‘tree’ (or in our case, coffee drinker) with speed and precision.

Hopefully, this post gave more insight into how this first step works. If you’d like to check out the codebase, you can visit its GitHub Repository. Keep an eye out for the next post on how we consume and serve the data to our dashboard.