Simplify News

Objective

The project comprises of two main segment: Summarizing News Articles and News4Kids (visualize news).

Accomplishments

  • Summarizing News Articles:
    • Data Collection: newspaper library used to obtain articles from BBC and Straits Times.
    • Data Preprocessing: Many irrelavent html tags are removed and to further refine our dataset we removed stopwords as they add little meaning to the content.
    • Implemented transformer models, BERT and GPT-2, to provide a summary of the information.
    • Model evaluation: Each model was evaluated using the ROUGE score, which is a metric that measures the similarity between the generated summary and the reference summary.
  • News4Kids
    • Data collection: To collect the articles we scraped the data from Reddit using PRAW which is Python library.
    • Data preprocessing: we want to avoid content that can be too grotesque and gloomy, therefore we performed sentiment analysis on the extracted content to filter out unwanted content.
    • Implemented Sentiment analysis using NLTK and TextBlob to choose which one is more suitable for our use case.
    • Implemented Named Entity Recognition to further refine our output for Stable Diffusion model that will generate comic style images based on the content.

Results

TextBlob had a higher accuracy so it was implemented in the final model evaluation. BERT had the highest ROGUE score of 0.99.