This is a small Python project that generates a minimalistic news site from Reddit posts.
Check it out on Github: https://github.com/johnyrose/reddit-news
A Static hosted demo can be found here.
And yes, this was made before Gen AI made it super easy
Background
The project uses 3 external APIs to achieve the generated news site:
- Reddit API (Using the Praw library) to get posts from Reddit
- URLMeta To extract display images from the links in the posts
- ExtractorAPI To extract the article text from the links in the posts
The used website template is based on News-journal by Andrea Dall’Ara.
The site generation flow generally goes through the following stages:
- Data collection:
- Get news posts from Reddit (from the configured subreddits)
- Generate a usable article from each post by extracting the article text and image from the post’s link
- Site generation:
- Go through the generated articles and choose which articles will appear on the site
- Generate a static site from the chosen articles with a Jinja template.
Quickstart
This section will guide you through the quickest way to get the site up and running. A more detailed explanation of the project’s configuration can be found in the Configuration section.
Prerequisites:
Before you run the project, you’ll need to generate API keys for the following services:
Each of these services has a free tier that should be enough for this project, none of them require any payment information, and getting the keys for all 3 should take no longer than 10 minutes.
From each API, make sure you have the following data:
- Reddit API:
- Username
- Password
- Client ID
- Client Secret
- URLMeta:
- API Key
- ExtractorAPI:
- API Key
Also, Python 3.8+ needs to be installed on your machine.
Setup
- Clone this repository and enter the project’s directory
git clone https://github.com/johnyrose/reddit-news cd reddit-news
- Install the project’s dependencies
pip install -r requirements.txt
- Create a file named
config.json
in the project’s root directory. This file will contain the configuration for the project. Copy the following content to the file:{ "tasks_to_perform": { "data_collection": true, "website_generation": true }, "external_apis": { "reddit": { "user": "<Your Reddit username>", "password": "<Your Reddit password>", "clientId": "<Your Reddit Client ID>", "clientSecret": "<Your Reddit Client Secret>", "userAgent": "bla" }, "extractor": { "url": "https://extractorapi.com/api/v1/extractor", "api_token": "<Your ExtractorAPI API Key>" }, "urlmeta": { "url": "https://api.urlmeta.org", "email": "<Your Email>", "api_token": "<Your URLMeta API Key>" } }, "general_settings": { "db_file": "sqlite.db", "enable_sqlalchemy_logging": false, "template_file": "site_generator/template.jinja2", "output_file": "output.html", "local_images_folder": "images", "use_local_images": true, "max_parallel_requests": 1 }, "subreddits": [ { "name": "worldnews", "posts": 15, "ignoreStickied": true }, { "name": "technews", "posts": 10, "ignoreStickied": true }, { "name": "UpliftingNews", "posts": 6, "ignoreStickied": true } ], "website": { "text_length": { "main_article": 500, "sub_articles": 200, "mini_articles": 100, "news_rows_articles": 150, "titles": 170 }, "articles_amount": { "sub_articles": 4, "mini_articles": 5, "news_row_articles": 3 }, "sorting_method": "score" } }
- Replace the values in the
external_apis
section with the API keys you generated earlier - Run the project
python main.py
- The generated site will be saved in the
output.html
file in the project’s root directory. Double click the file to open it in your browser. - You should now see the generated site in your browser.
- Replace the values in the
Configuration
This section will elaborate on the options available in the project’s configuration file. The config template can be found in the Quickstart section.
tasks_to_perform:
This section contains a boolean value for each of the project’s tasks. Setting a task’s value to true
will cause the task to be performed when the project is run. Setting a task’s value to false
will cause the task to be skipped when the project is run.
This can be useful when you already ran the data collection phase and want to play around with the config when generating the site.
- data_collection: Whether to perform the data collection task. If set to
false
, the project will skip the data collection task and use the existing data from the database instead. - website_generation: Whether to perform the website generation task.
subreddits:
This section contains a list of subreddits to get posts from. Each subreddit is represented by a dictionary with the following fields:
- name: The name of the subreddit to get posts from. This is the name that appears in the subreddit’s URL. For example, the name of the r/worldnews subreddit is
worldnews
. - posts: The amount of posts to get from the subreddit.
- ignoreStickied: Whether to ignore stickied posts.
I highly suggest trying different subreddits and seeing different results.
general_settings:
These are general settings for the project. For the most part, I’d suggest leaving them as they are in the example above.
- db_file: The path to the SQLITE database file.
- enable_sqlalchemy_logging: Whether to enable SQLAlchemy logging.
- template_file: The path to the template file to use when generating the site.
- output_file: The path to the output file to save the generated site to.
- use_local_images: Whether to download the images locally when generating the site. Setting this to
false
will use the original image URLs instead. This can cause some of the images to be broken after a while. - local_images_folder: The path to the folder to save the images to (only relevant if
use_local_images
is set totrue
). - max_parallel_requests: The maximum amount of parallel requests to make to the external APIs. This can be useful if you want to limit the amount of requests you make to the external APIs.
website:
- text_length: The maximum length of the text to display in the generated site for each type of article.
- articles_amount: The amount of articles to display in the generated site for each type of article.
- Note: Setting the
news_row
setting to a value higher than 3 will cause the site to look weird.
- Note: Setting the
- sorting_method: The sorting method to use when sorting the articles in the generated site. Can be either
score
orrelative_upvotes
.- score: Sorts the articles by their score (upvotes).
- relative_upvotes: Sorts the articles by their relative upvotes. This is calculated by dividing the article’s upvotes by the amount of upvotes the article with the most upvotes has.
- This method is a weird attempt to balance the article order between the different subreddit. It doesn’t always work well, and I’d suggest trying it only if you’re unsatisfied with the results of the
score
method.
- This method is a weird attempt to balance the article order between the different subreddit. It doesn’t always work well, and I’d suggest trying it only if you’re unsatisfied with the results of the
Thank you for checking out the project and I hope you enjoy using it! Feel free to contact me for any suggestions, feedback of questions.