Dataset description

Before uploading data to EnergyDataDK, data owners must provide specific information about their data.

This information is crucial for ensuring smooth data usage by both users and data owners.
Most of the information, unless otherwise specified, is visible to all users.

This guide consists of two parts:

  1. Setting up a dataset
  2. Setting up a datastream

Setting up a dataset

A dataset is essentially a collection of interrelated related datastreams. Therefore not much information is needed to set one up. The dataset set must obviously have a name, so it can be identified, a MQTT topic prefix, to identify the dataset to a data broker, a description, to specify details about the data contained within the dataset, and lastly a picture can be added to make the dataset visually easier to identify in the dataset overview.

TL;DR

Datasets must have a unique name and MQTT topic prefix. The latter is essentially the first level of a hierarchy.
In this example of an MQTT topic: denmark/hovedstaden/lyngby, the prefix would be denmark.
The prefix must consist of only hyphens, underscores, and alphanumeric characters.

Datasets must also have a proper description that informs about the source, period, irregularities, usage, and a POC.

Optionally, an image can be uploaded to graphically represent the dataset.

Dataset name

The name of the dataset will be the primary means of identifying what data is contained within the dataset by EnergyDataDK users. The name should be intuitive for the data owner, internal-, and external users. We suggest including a project, company, or lab name, and to include the information type.

Here are some examples:
  • Project name/Wind data
  • Master thesis/X data;
  • Company name/Project name

MQTT topic prefix

MQTT topics are a fundamental part of how the MQTT protocol routes messages between publishers and subscribers. They act as “addresses” that define where each message should be delivered. MQTT topics are hierarchical, and have levels separated with slashes (/). So you may consider the prefix to be the first level of the hierachy.
For example, if this would be our MQTT topic: usa/california/san-francisco/silicon-valley, then usa would be our MQTT topic prefix.

A topic prefix is a single string of alphanumeric characters, underscores, and hyphens. Furthermore, since MQTT topics are case sensitive, it’s recommended to only use lowercase letters. The topic prefix is only visible to dataset owners. You can read more about its usage in the API description.

Important: Only alphanumerical characters, hyphens, underscores, and slashes are permitted, spaces can't be used to separate words.

Description

To ensure smooth data usage by both users and data owners, it is crucial to provide comprehensive information about the dataset. This should include the following:
  • General description of the data (type, source, etc.)
  • Data granularity in the data set
  • Period covered by the dataset
  • Known irregularities
  • Usage Restrictions
  • Contact Person

 

The dataset description can be edited by data owners after the dataset’s creation and should be updated as soon as possible when the above information is available, or if anything changes.

Example of a dataset description

The dataset consists of synthetic data generated for demonstration purposes. It contains randomly generated records representing different data types commonly used in structured datasets. Dataset Structure:
  • Alphanumeric Data: 2 independent datastreams containing randomly generated text strings.
  • Integer Data: 3 independent datastreams with randomly generated numbers.
  • Boolean Values: 1 datastream representing True/False values.
The dataset includes 100 records spanning from April 1st, 2023, to April 5th, 2023. There are missing values between 14:00 and 18:00 on the 4th of April due to the server maintenance carried out at that time. Data is recorded at hourly intervals. To use the dataset, the user must sign an NDA. For further details regarding the dataset and the NDA, please contact: example@email.com

Picture

Adding a picture is an optional feature, but it makes it easier to identify datasets.
If you have many datasets, avoid using the same picture for all of these, as it would defeat the purpose.

The picture should be intuitive both for the data owner and users with access to the data.

Setting up datastreams

A datastream is essentially a channel where data from a sensor, measurement device or similar is received.
All observations at the channel are a tuple with a time stamp indicating when observation occurred and the value that was measured.
All timestamps in EnergyDataDK are in UTC time.

Each datastream is assigned a name, a MQTT topic suffix, a data type, and is described by a number of mandatory tags (metadata) that qualify the data.

TL;DR

Datastreams must have a unique name and MQTT topic suffix. The latter is essentially the part of the MQTT topic beyond the prefix (first level) of the hierarchical structure. So in this example: denmark/hovedstaden/lyngby, the suffix would be hovedstaden/lyngby.

The suffix must only consist of hyphens, underscores, slashes and alphanumeric characters.

The type of data (integer, double, or string) in the datastream must be declared.

There are a fixed number of mandatory fields that must be filled out and you can additionally add a virtually unlimited umber of extra metadata fields.

Datastream name

Similar to the dataset, it’s important to carefully choose a name which makes it easy to understand for any user what data is recorded in the stream.

MQTT topic suffix

As described above, the MQTT topic is a fundamental part how messages are routed. The combination of the MQTT topic prefix and suffix is used to identify a datastream, therefore, the suffix must be unique! The topic suffix is solely visible to the dataset owner(s) and users with “read” permission to the dataset. A MQTT topic suffix can solely consist of multiple alphanumeric strings separated by “/”, which signify the levels in the topic hierarchy. For example, if this would be our MQTT topic: usa/california/san-francisco/silicon-valley, then california/san-francisco/silicon-valley would be our MQTT topic suffix.
Important: Only alphanumerical characters, hyphens, underscores, and slashes are permitted, spaces can't be used to separate words.

Datatype

You must specify the datatype of the datastream. This can be one of the following:
  • Integer
    Whole numbers without decimals.
  • Double
    Numbers with decimals. Please note that you must use a period, and not a comma, as your decimal separator!
  • String
    Words or even complete sentences, including numbers an special characters.

Properties

Each datastream has a number of mandatory fields which qualify the data contained therein.
You can also add a virtually unlimited number of custom fields.

Comment

Here you should enter more detailed information about the datastream which isn’t already made clear from it’s name.

Data license

Here are some CC licenses which describe the terms of use they are listed from most to least permissive below.

GDPR classification

While GDPR does not mandate a specific classification policy, it requires organizations to categorize and protect data appropriately based on sensitivity and risk. There are several categories of data.
  • Personal Data
    Any information relating to an identified or identifiable natural person, such as names, location data, phone numbers, and online identifiers.
  • Sensitive Data
    Data revealing racial/ethnic origin, political opinions, religious beliefs, trade union membership, genetic data, biometric data, or health data.
  • Pseudonymized Data
    Personal data processed so it can no longer be attributed to a specific person without additional information.
  • Anonymized Data
    Data rendered anonymous so the data subject can’t be identified.

Geo tag

The geographical coordinates of where the data is collected. You can enter just a region or address in the text field, the system will offer matches to your query with their corresponding geo-location coordinates.

Location

The name of the installation where the data collection takes place.

Organization

The name of the organization responsible for the data collection.

Project tag

The name of the project the data is being collected for.

Theme tag

This categorizes the datastream by its subject. Since this will likely be similar to the search term used to find a particular datastream it should be clear and consise.

Here a a few examples: “Solar energy”, “CO2 emission”, “district heating”, etc.

Unit

The Unit of measurement for the data in the datastream. The system will suggest an option based on your input.

Custom headers

You have the option of adding a virtually unlimited number of extra metadata fields to your datastream, besides the mandatory ones. This could be anything you or other users may find relevant. 

Keep in mind that the naming must be very clear and intuitive, since these will be nonstandard fields. You may also want to consider adding some documentation about these metadata fields in the description of the dataset.