Obtaining and Cleaning News Data From Factiva

Jean Dinco
3 min readSep 3, 2020

--

Factiva is an online archive owned by Dow Jones & Company which provides access to different news articles from various sources worldwide. You need an account for you to be able to use their platform. I am personally using my university access as this is related to my PhD.

PREREQUISITES:

  1. Python 3.7+
  2. Pandas (pip install pandas)
  3. Os (import os)
  4. Glob (import glob)

OBJECTIVE:

The objective of this short blog is to get newspaper data from Factiva and convert them into a Pandas dataframe or a .CSV file (in case you want to use it on other data mining/analysis software).

STEPS:

Step 1. Download the dataset from Factiva

This is not a Factiva tutorial blog, so if you do not know how to get started with using Factiva, you can check the Dow Jones Youtube Channel. In my case, I used the Factiva Search Form (not the Free Text Search) and typed my keyword/s on the ‘All of these words’ search bar. I saved the results as an HTML file into my local folder called Factiva. Assuming you have downloaded more than one HTML file, you would need to use the glob module to read all the HTML file in the directory.

import glob
import pandas as pd
import os
files = glob.glob(r'C:\Users\jdee\Desktop\Factiva\*.htm', recursive = True)

Take note that when recursive is set to True, ** followed by \ will match any files or directories.

Step 2. Read the files using pd.read_html

Now, let’s make an empty list where we would save all our data (again: assuming that you have multiple HTML files).

empty_list = []for file in files: 
data = pd.read_html(file, index_col = 0)
empty_list.extend(data)

After running the code, try calling the empty_list and you will see that it is no longer empty!

These abbreviations (field codes) are not random letters, they do mean something. For example, HD stands for headline, WC stands for word count et al. Notice that I used extend() not append() because the latter would only add one element to the list, which means len(list) only increases by one. This is not what we want.

Step 3. Concatenate all files into one dataframe with transposed columns and rows

Let’s now take our no-longer empty_list, concatenate and transpose it using a list comprehension.

frames = pd.concat([l for l in empty_list if 'HD' in l.index.values], axis=1).T

This code would now give you the results you are looking for:

BONUS STEP:

Factiva provides several fields which I will not be needing in my data analysis, so I am dropping some of the unnecessary columns and rename the remaining columns because abbreviations do confuse me.

frames.drop(columns=['SC', 'CY', 'RE', 'PUB', 'NS',
'CR', 'IPD', 'IPC', 'CT',
'IN', 'VOL', 'RF', 'LA', 'CO'], inplace = True)
frames.rename(columns = {'AN':'Accession_Number','SE': 'Section', 'HD': 'Headline','WC':'Word_Count', 'PD': 'Publication_Date','SN': 'Source_Name', 'ED':'Edition', 'PG':'Page','LP': 'Lead', 'TD': 'Body', 'ET':'Estimated_Time','BY':'Author_Name', 'ART':'Captions(pics)','CLM':'Column'}, inplace=True)

Then, let’s sort the data by the publication date:

frames['Publication_Date'] = pd.to_datetime(frames['Publication_Date'])frames.sort_values(by='Publication_Date', inplace=True)

You can do your analysis here now using other machine learning python libraries or you could save the dataframe as .CSV to import it on other data analysis software!

In case you don’t know, the code to save it as .CSV is:

frames.to_csv('filename.csv')

--

--

Jean Dinco
Jean Dinco

Written by Jean Dinco

Jean is a PhD candidate working on data, media & conflict forecasting.

Responses (2)