hdx.freshness.app.datafreshness

Determine freshness for all datasets in HDX

DataFreshness Objects

class DataFreshness()

[view_source]

Data freshness main class

Arguments:

  • session sqlalchemy.orm.Session - Session to use for queries
  • testsession Optional[sqlalchemy.orm.Session] - Session for test data or None
  • datasets Optional[List[Dataset]] - List of datasets or read from HDX if None
  • now datetime - Date to use or take current time if None
  • do_touch bool - Whether to touch HDX resources whose hash has changed

__init__

def __init__(session: Session,
             testsession: Optional[Session] = None,
             datasets: Optional[List[Dataset]] = None,
             now: datetime = None,
             do_touch: bool = False) -> None

[view_source]

no_resources_force_hash

def no_resources_force_hash() -> Optional[int]

[view_source]

Get number of resources to force hash

Returns:

  • Optional[int] - Number of resources to force hash or None

spread_datasets

def spread_datasets() -> None

[view_source]

Try to arrange the list of datasets so that downloads don't keep hitting the same server by moving apart datasets from the same organisation

Returns:

None

add_new_run

def add_new_run() -> None

[view_source]

Add a new run number with corresponding date

Returns:

None

prefix_what_updated

@staticmethod
def prefix_what_updated(dbresource: DBResource, prefix: str) -> None

[view_source]

Prefix the what_updated field of resource

Arguments:

  • dbresource DBResource - DBResource object to change
  • prefix str - Prefix to prepend

Returns:

None

process_resources

def process_resources(
    dataset_id: str,
    previous_dbdataset: DBDataset,
    resources: List[Resource],
    updated_by_script: Optional[datetime],
    hash_ids: List[str] = None
) -> Tuple[List[Tuple], Optional[str], Optional[datetime]]

[view_source]

Process HDX dataset's resources. If the resource has not been checked for 30 days and we are below the threshold for resource checking, then the resource is flagged to be hashed even if the dataset is fresh.

Arguments:

  • dataset_id str - Dataset id
  • previous_dbdataset DBDataset - DBDataset object from previous run
  • resources List[Resource] - HDX resources to process
  • updated_by_script Optional[datetime] - Time script updated or None
  • hash_ids Optional[List[str]] - Resource ids to hash for testing purposes

Returns:

Tuple[List[Tuple], Optional[str], Optional[datetime]]: (resources to download, id of last resource updated, time updated)

process_datasets

def process_datasets(
    hash_ids: Optional[List[str]] = None
) -> Tuple[Dict[str, str], List[Tuple]]

[view_source]

Process HDX datasets. Extract necessary metadata and store in the freshness database. Calculate an initial freshness based on the metadata (last modified - which can change due to filestore resource changes, review date - when someone clicks the reviewed button the UI, updated by script - scripts provide the date of update in HDX metadata) For datasets that are not initially fresh or which have resources that have not been checked in the last 30 days (up to the threshold for the number of resources to check), the resources are flagged to be downloaded and hashed.

Arguments:

  • hash_ids Optional[List[str]] - Resource ids to hash for testing purposes

Returns:

Tuple[Dict[str, str], List[Tuple]]: (datasets to check, resources to check)

check_urls

def check_urls(
    resources_to_check: List[Tuple],
    user_agent: str,
    results: Optional[Dict] = None,
    hash_results: Optional[Dict] = None
) -> Tuple[Dict[str, Tuple], Dict[str, Tuple]]

[view_source]

Download resources and hash them. If the hash has changed compared to the previous run, download and hash again. Return two dictionaries, the first with the hashes from the first downloads and the second with the hashes from the second downloads.

Arguments:

  • resources_to_check List[Tuple] - List of resources to be checked
  • user_agent str - User agent string to use when downloading
  • results Optional[Dict] - Test results to use in place of first downloads
  • hash_results Optional[Dict] - Test results replacing second downloads

Returns:

Tuple[Dict[str, Tuple], Dict[str, Tuple]]: (results of first download, results of second download)

process_results

def process_results(
    results: Dict[str, Tuple],
    hash_results: Dict[str, Tuple],
    resourcecls: Union[Resource,
                       Any] = Resource) -> Dict[str, Dict[str, Tuple]]

[view_source]

Process the downloaded and hashed resources. If the two hashes are the same but different to the previous run's, the file has been changed. If the two hashes are different, it is an API (eg. editable Google sheet) where the hash constantly changes. If the file is determined to have been changed, then the resource on HDX is touched to update its last_modified field. Return a dictionary of dictionaries from dataset id to resource ids to update information about resources including their latest_of_modifieds.

Arguments:

  • results Dict[str, Tuple] - Test results to use in place of first downloads
  • hash_results Dict[str, Tuple] - Test results replacing second downloads
  • resourcecls Union[Resource, Any] - Class to use. Defaults to Resource.

Returns:

Dict[str, Dict[str, Tuple]]: Dataset id to resource id to resource info

update_dataset_latest_of_modifieds

def update_dataset_latest_of_modifieds(
        datasets_to_check: Dict[str, str],
        datasets_resourcesinfo: Dict[str, Dict[str, Tuple]]) -> None

[view_source]

Given the dictionary of dictionaries from dataset id to resource ids to update information about resources including their latest_of_modifieds, work out latest_of_modifieds for datasets and calculate freshness.

Arguments:

  • datasets_to_check Dict[str, str] - Datasets with resources that were hashed
  • datasets_resourcesinfo Dict[str, Dict[str, Tuple]] - Dataset id to resource id to resource info

Returns:

None

output_counts

def output_counts() -> str

[view_source]

Create and display output string

Returns:

  • str - Output string

set_latest_of_modifieds

@staticmethod
def set_latest_of_modifieds(dbobject: Union[DBDataset, DBResource],
                            modified_date: datetime,
                            what_updated: str) -> Tuple[str, bool]

[view_source]

Set latest of modifieds if provided date is greater than current and add to the Database object's what_updated field.

Arguments:

  • dbobject Union[DBDataset, DBResource] - Database object to update
  • modified_date datetime - New modified date
  • what_updated str - What updated eg. hash

Returns:

Tuple[str, bool]: (DB object's what_updated, whether new date > current)

add_what_updated

@staticmethod
def add_what_updated(prev_what_updated: str, what_updated: str)

[view_source]

Add to what_updated string any new cause of update (such as hash). "nothing" is removed if anything else is added.

Arguments:

  • prev_what_updated str - Previous what_updated string
  • what_updated str - Additional what_updated string

Returns:

  • str - New what_updated string

calculate_freshness

def calculate_freshness(last_modified: datetime, update_frequency: int) -> int

[view_source]

Calculate freshness based on a last modified date and the expected update frequency. Returns 0 for fresh, 1 for due, 2 for overdue and 3 for delinquent.

Arguments:

  • last_modified datetime - Last modified date
  • update_frequency int - Expected update frequency

Returns:

  • int - 0 for fresh, 1 for due, 2 for overdue and 3 for delinquent