Abstract
JSON is a very popular data format in many applications in Web and enterprise. Recently, many data analytical systems support the loading and querying JSON data. However, JSON parsing can be costly, which dominates the execution time of querying JSON data. Many previous studies focus on building efficient parsers to reduce this parsing cost, and little work has been done on how to reduce the occurrences of parsing. In this paper, we start with a study with a real production workload in Alibaba, which consists of over 3 million queries on JSON. Our study reveals significant temporal and spatial correlations among those queries, which result in massive redundant parsing operations among queries. Instead of repetitively parsing the JSON data, we propose to develop a cache system named Maxson for caching the JSON query results (the values evaluated from JSONPath) for reuse. Specifically, we develop effective machine learning-based predictor with combining LSTM (long shortterm memory) and CRF (conditional random field) to determine the JSONPaths to cache given the space budget. We have implemented Maxson on top of SparkSQL. We experimentally evaluate Maxson and show that 1) Maxson is able to eliminate the most of duplicate JSON parsing overhead, 2) Maxson improves end-to-end workload performance by 1.5-6.5×.
Original language | English |
---|---|
Title of host publication | Proceedings - 2020 IEEE 36th International Conference on Data Engineering, ICDE 2020 |
Publisher | IEEE |
Publication date | Apr 2020 |
Pages | 1621-1632 |
Article number | 9101499 |
ISBN (Electronic) | 9781728129037 |
DOIs | |
Publication status | Published - Apr 2020 |
Event | 36th IEEE International Conference on Data Engineering, ICDE 2020 - Dallas, United States Duration: 20 Apr 2020 → 24 Apr 2020 |
Conference
Conference | 36th IEEE International Conference on Data Engineering, ICDE 2020 |
---|---|
Country/Territory | United States |
City | Dallas |
Period | 20/04/2020 → 24/04/2020 |
Keywords
- Data analytics system
- JSON parsing
- Semi-structured format