hdx.freshness.app.datafreshness
Determine freshness for all datasets in HDX
DataFreshness Objects
class DataFreshness()
Data freshness main class
Arguments:
session
sqlalchemy.orm.Session - Session to use for queriestestsession
Optional[sqlalchemy.orm.Session] - Session for test data or Nonedatasets
Optional[List[Dataset]] - List of datasets or read from HDX if Nonenow
datetime - Date to use or take current time if Nonedo_touch
bool - Whether to touch HDX resources whose hash has changed
__init__
def __init__(session: Session,
testsession: Optional[Session] = None,
datasets: Optional[List[Dataset]] = None,
now: datetime = None,
do_touch: bool = False) -> None
no_resources_force_hash
def no_resources_force_hash() -> Optional[int]
Get number of resources to force hash
Returns:
Optional[int]
- Number of resources to force hash or None
spread_datasets
def spread_datasets() -> None
Try to arrange the list of datasets so that downloads don't keep hitting the same server by moving apart datasets from the same organisation
Returns:
None
add_new_run
def add_new_run() -> None
Add a new run number with corresponding date
Returns:
None
prefix_what_updated
@staticmethod
def prefix_what_updated(dbresource: DBResource, prefix: str) -> None
Prefix the what_updated field of resource
Arguments:
dbresource
DBResource - DBResource object to changeprefix
str - Prefix to prepend
Returns:
None
process_resources
def process_resources(
dataset_id: str,
previous_dbdataset: DBDataset,
resources: List[Resource],
updated_by_script: Optional[datetime],
hash_ids: List[str] = None
) -> Tuple[List[Tuple], Optional[str], Optional[datetime]]
Process HDX dataset's resources. If the resource has not been checked for 30 days and we are below the threshold for resource checking, then the resource is flagged to be hashed even if the dataset is fresh.
Arguments:
dataset_id
str - Dataset idprevious_dbdataset
DBDataset - DBDataset object from previous runresources
List[Resource] - HDX resources to processupdated_by_script
Optional[datetime] - Time script updated or Nonehash_ids
Optional[List[str]] - Resource ids to hash for testing purposes
Returns:
Tuple[List[Tuple], Optional[str], Optional[datetime]]: (resources to download, id of last resource updated, time updated)
process_datasets
def process_datasets(
hash_ids: Optional[List[str]] = None
) -> Tuple[Dict[str, str], List[Tuple]]
Process HDX datasets. Extract necessary metadata and store in the freshness database. Calculate an initial freshness based on the metadata (last modified - which can change due to filestore resource changes, review date - when someone clicks the reviewed button the UI, updated by script - scripts provide the date of update in HDX metadata) For datasets that are not initially fresh or which have resources that have not been checked in the last 30 days (up to the threshold for the number of resources to check), the resources are flagged to be downloaded and hashed.
Arguments:
hash_ids
Optional[List[str]] - Resource ids to hash for testing purposes
Returns:
Tuple[Dict[str, str], List[Tuple]]: (datasets to check, resources to check)
check_urls
def check_urls(
resources_to_check: List[Tuple],
user_agent: str,
results: Optional[Dict] = None,
hash_results: Optional[Dict] = None
) -> Tuple[Dict[str, Tuple], Dict[str, Tuple]]
Download resources and hash them. If the hash has changed compared to the previous run, download and hash again. Return two dictionaries, the first with the hashes from the first downloads and the second with the hashes from the second downloads.
Arguments:
resources_to_check
List[Tuple] - List of resources to be checkeduser_agent
str - User agent string to use when downloadingresults
Optional[Dict] - Test results to use in place of first downloadshash_results
Optional[Dict] - Test results replacing second downloads
Returns:
Tuple[Dict[str, Tuple], Dict[str, Tuple]]: (results of first download, results of second download)
process_results
def process_results(
results: Dict[str, Tuple],
hash_results: Dict[str, Tuple],
resourcecls: Union[Resource,
Any] = Resource) -> Dict[str, Dict[str, Tuple]]
Process the downloaded and hashed resources. If the two hashes are the same but different to the previous run's, the file has been changed. If the two hashes are different, it is an API (eg. editable Google sheet) where the hash constantly changes. If the file is determined to have been changed, then the resource on HDX is touched to update its last_modified field. Return a dictionary of dictionaries from dataset id to resource ids to update information about resources including their latest_of_modifieds.
Arguments:
results
Dict[str, Tuple] - Test results to use in place of first downloadshash_results
Dict[str, Tuple] - Test results replacing second downloadsresourcecls
Union[Resource, Any] - Class to use. Defaults to Resource.
Returns:
Dict[str, Dict[str, Tuple]]: Dataset id to resource id to resource info
update_dataset_latest_of_modifieds
def update_dataset_latest_of_modifieds(
datasets_to_check: Dict[str, str],
datasets_resourcesinfo: Dict[str, Dict[str, Tuple]]) -> None
Given the dictionary of dictionaries from dataset id to resource ids to update information about resources including their latest_of_modifieds, work out latest_of_modifieds for datasets and calculate freshness.
Arguments:
datasets_to_check
Dict[str, str] - Datasets with resources that were hasheddatasets_resourcesinfo
Dict[str, Dict[str, Tuple]] - Dataset id to resource id to resource info
Returns:
None
output_counts
def output_counts() -> str
Create and display output string
Returns:
str
- Output string
set_latest_of_modifieds
@staticmethod
def set_latest_of_modifieds(dbobject: Union[DBDataset, DBResource],
modified_date: datetime,
what_updated: str) -> Tuple[str, bool]
Set latest of modifieds if provided date is greater than current and add to the Database object's what_updated field.
Arguments:
dbobject
Union[DBDataset, DBResource] - Database object to updatemodified_date
datetime - New modified datewhat_updated
str - What updated eg. hash
Returns:
Tuple[str, bool]: (DB object's what_updated, whether new date > current)
add_what_updated
@staticmethod
def add_what_updated(prev_what_updated: str, what_updated: str)
Add to what_updated string any new cause of update (such as hash). "nothing" is removed if anything else is added.
Arguments:
prev_what_updated
str - Previous what_updated stringwhat_updated
str - Additional what_updated string
Returns:
str
- New what_updated string
calculate_freshness
def calculate_freshness(last_modified: datetime, update_frequency: int) -> int
Calculate freshness based on a last modified date and the expected update frequency. Returns 0 for fresh, 1 for due, 2 for overdue and 3 for delinquent.
Arguments:
last_modified
datetime - Last modified dateupdate_frequency
int - Expected update frequency
Returns:
int
- 0 for fresh, 1 for due, 2 for overdue and 3 for delinquent