As of this writing, Ghana ranks as the twenty seventh most polluted nation on this planet, going through vital challenges because of air air pollution. Recognizing the essential position of air high quality monitoring, many African nations, together with Ghana, are adopting low-cost air high quality sensors.
The Sensor Analysis and Coaching Centre for West Africa (Afri-SET), goals to make use of expertise to deal with these challenges. Afri-SET engages with air high quality sensor producers, offering essential evaluations tailor-made to the African context. By way of evaluations of sensors and knowledgeable decision-making help, Afri-SET empowers governments and civil society for efficient air high quality administration.
On December 6th-8th 2023, the non-profit group, Tech to the Rescue, in collaboration with AWS, organized the world’s largest Air High quality Hackathon – geared toward tackling one of many world’s most urgent well being and environmental challenges, air air pollution. Greater than 170 tech groups used the newest cloud, machine studying and synthetic intelligence applied sciences to construct 33 options. The answer addressed on this weblog solves Afri-SET’s problem and was ranked as the highest 3 profitable options.
This publish presents an answer that makes use of a generative synthetic intelligence (AI) to standardize air high quality information from low-cost sensors in Africa, particularly addressing the air high quality information integration downside of low-cost sensors. The answer harnesses the capabilities of generative AI, particularly Massive Language Fashions (LLMs), to deal with the challenges posed by numerous sensor information and robotically generate Python capabilities based mostly on varied information codecs. The basic goal is to construct a manufacturer-agnostic database, leveraging generative AI’s capability to standardize sensor outputs, synchronize information, and facilitate exact corrections.
Present challenges
Afri-SET at present merges information from quite a few sources, using a bespoke method for every of the sensor producers. This handbook synchronization course of, hindered by disparate information codecs, is resource-intensive, limiting the potential for widespread information orchestration. The platform, though purposeful, offers with CSV and JSON recordsdata containing a whole lot of 1000’s of rows from varied producers, demanding substantial effort for information ingestion.
The target is to automate information integration from varied sensor producers for Accra, Ghana, paving the way in which for scalability throughout West Africa. Regardless of the challenges, Afri-SET, with restricted sources, envisions a complete information administration resolution for stakeholders searching for sensor internet hosting on their platform, aiming to ship correct information from low-cost sensors. The try is deprived by the present deal with information cleansing, diverting useful expertise away from constructing ML fashions for sensor calibration. Moreover, they intention to report corrected information from low-cost sensors, which requires data past particular pollution.
The answer had the next necessities:
- Cloud internet hosting – The answer should reside on the cloud, guaranteeing scalability and accessibility.
- Automated information ingestion – An automatic system is crucial for recognizing and synchronizing new (unseen), numerous information codecs with minimal human intervention.
- Format flexibility – The answer ought to accommodate each CSV and JSON inputs and be versatile on the formatting (any affordable column names, models of measure, any nested construction, or malformed CSV resembling lacking columns or additional columns)
- Golden copy preservation – Retaining an untouched copy of the info is crucial for reference and validation functions.
- Value-effective – The answer ought to solely invoke LLM to generate reusable code on an as-needed foundation as a substitute of manipulating the info on to be as cost-effective as doable.
The purpose was to construct a one-click resolution that takes totally different information construction and codecs (CSV and JSON) and robotically converts them to be built-in right into a database with unified headers, as proven within the following determine. This permits for information to be aggregated for additional manufacturer-agnostic evaluation.

Determine 1: Covert information with totally different information codecs right into a desired information format with unified headers
Overview of resolution
The proposed resolution makes use of Anthropic’s Claude 2.1 basis mannequin by Amazon Bedrock to generate Python codes, which converts enter information right into a unified information format. LLMs excel at writing code and reasoning over textual content, however are likely to not carry out as nicely when interacting instantly with time-series information. On this resolution, we leverage the reasoning and coding skills of LLMs for creating reusable Extract, Remodel, Load (ETL), which transforms sensor information recordsdata that don’t conform to a common normal to be saved collectively for downstream calibration and evaluation. Moreover, we make the most of the reasoning capabilities of LLMs to know what the labels imply within the context of air high quality sensor, resembling particulate matter (PM), relative humidity, temperature, and many others.
The next diagram exhibits the conceptual structure:

Determine 2: The AWS reference structure and the workflow for information transformation with Amazon Bedrock
Answer walkthrough
The answer reads uncooked information recordsdata (CSV and JSON recordsdata) from Amazon Easy Storage Service (Amazon S3) (Step 1) and checks if it has seen the machine sort (or information format) earlier than. If sure, the answer retrieves and executes the previously-generated python codes (Step 2) and the reworked information is saved in S3 (Step 10). The answer solely invokes the LLM for brand spanking new machine information file sort (code has not but been generated). That is achieved to optimize efficiency and decrease value of LLM invocation. If the Python code will not be accessible for a given machine information, the answer notifies the operator to examine the brand new information format (Step 3 and Step 4). Presently, the operator checks the brand new information format and validates if the brand new information format is from a brand new producer (Step 5). Additional, the answer checks if the file is CSV or JSON. If it’s a CSV file, the info could be instantly transformed to a Pandas information body by a Python operate with out LLM invocation. If it’s a JSON file, the LLM is invoked to generate a Python operate that creates a Pandas information body from the JSON payload contemplating its schema and the way nested it’s (Step 6).
We invoke the LLM to generate Python capabilities that manipulate the info with three totally different prompts (enter string):
- The primary invocation (Step 6) generates a Python operate that converts a JSON file to a Pandas information body. JSON recordsdata from producers have totally different schemas. Some enter information makes use of a pair of worth sort and worth for a measurement. The latter format leads to information frames containing one column of worth sort and one column of worth. Such columns should be pivoted.
- The second invocation (Step 7) determines if the info must be pivoted and generates a Python operate for pivoting if wanted. One other situation of the enter information is that the identical air high quality measurement can have totally different names from totally different producers; for instance, “P1” and “PM1” are for a similar sort of measurement.
- The third invocation (Step 8) focuses on information cleansing. It generates a Python operate to transform information frames to a standard information format. The Python operate could embrace steps for unifying column names for a similar sort of measurement and dropping columns.
All LLM generated Python codes are saved within the repository (Step 9) in order that this can be utilized to course of day by day uncooked machine information recordsdata for transformation into a standard format.
The info is then saved in Amazon S3 (Step 10) and could be printed to OpenAQ so different organizations can use the calibrated air high quality information.
The next screenshot exhibits the proposed frontend for illustrative functions solely as the answer is designed to combine with Afri-SET’s current backend system
Outcomes
The proposed technique minimizes LLM invocations, thus optimizing value and sources. The answer solely invokes the LLM when a brand new information format is detected. The code that’s generated is saved, in order that an enter information with the identical format (seen earlier than) can reuse the code for information processing.
A human-in-the-loop mechanism safeguards information ingestion. This occurs solely when a brand new information format is detected to keep away from overburdening scarce Afri-SET sources. Having a human-in-the-loop to validate every information transformation step is non-compulsory.
Computerized code era reduces information engineering work from months to days. Afri-SET can use this resolution to robotically generate Python code, based mostly on the format of enter information. The output information is reworked to a standardized format and saved in a single location in Amazon S3 in Parquet format, a columnar and environment friendly storage format. If helpful, it may be additional prolonged to an information lake platform that makes use of AWS Glue (a serverless information integration service for information preparation) and Amazon Athena (a serverless and interactive analytics service) to research and visualize information. With AWS Glue customized connectors, it’s easy to switch information between Amazon S3 and different functions. Moreover, it is a no-code expertise for Afri-SET’s software program engineer to effortlessly construct their information pipelines.
Conclusion
This resolution permits for simple information integration to assist develop cost-effective air high quality monitoring. It provides data-driven and knowledgeable laws, fostering group empowerment and inspiring innovation.
This initiative, geared toward gathering exact information, is a major step in direction of a cleaner and more healthy atmosphere. We consider that AWS expertise may also help handle poor air high quality by technical options just like the one described right here. If you wish to prototype comparable options, apply to the AWS Well being Fairness initiative.
As at all times, AWS welcomes your suggestions. Please go away your ideas and questions within the feedback part.
In regards to the authors
Sandra Subject is an Environmental Fairness Chief at AWS. On this position, she leverages her engineering background to search out new methods to make use of expertise for fixing the world’s “To Do listing” and drive constructive social affect. Sandra’s journey consists of social entrepreneurship and main sustainability and AI efforts in tech corporations.
Qiong (Jo) Zhang, PhD, is a Senior Accomplice Options Architect at AWS, specializing in AI/ML. Her present areas of curiosity embrace federated studying, distributed coaching, and generative AI. She holds 30+ patents and has co-authored 100+ journal/convention papers. She can be the recipient of the Finest Paper Award at IEEE NetSoft 2016, IEEE ICC 2011, ONDM 2010, and IEEE GLOBECOM 2005.
Gabriel Verreault is a Senior Accomplice Options Architect at AWS for the Industrial Manufacturing section. Gabriel works with AWS companions to outline, construct, and evangelize options round Sensible Manufacturing, Sustainability and AI/ML. Gabriel additionally has experience in industrial information platforms, predictive upkeep, and mixing AI/ML with industrial workloads.
Venkatavaradhan (Venkat) Viswanathan is a World Accomplice Options Architect at Amazon Internet Companies. Venkat is a Know-how Technique Chief in Knowledge, AI, ML, generative AI, and Superior Analytics. Venkat is a World SME for Databricks and helps AWS prospects design, construct, safe, and optimize Databricks workloads on AWS.