Research Highlights

Advancing Toward Optimal Sampling: The Zenith of Data Analysis

2023-07-06 124

[POSTECH professor Wook-Shin Han’s team develops a new data sampling method for machine learning.]

The world witnessed a monumental face-off between human intelligence and artificial intelligence in March 2016. The computer program AlphaGo honed its skills from a substantial database and emerged victorious against a human opponent in Go, a game renowned for its complexity in calculating countless possible moves. The importance of quality data for AI’s continuous evolution is undeniable. AI has seamlessly integrated into such sectors as healthcare, finance, and education, while its advancements rely heavily on the availability of robust data for learning.

Data is typically stored in distributed groups known as tables. For an AI to glean insights from these table-stored data, a “join” process is deployed to amalgamate these disparate tables into one comprehensive table. The sheer scale of this resultant table presents challenges in terms of storage, while the join process itself can be quite time-consuming. Even now, developing techniques for swift and uniform data sampling from tables remains a complex puzzle yet to be solved in data science.

한욱신 교수팀_뷰페이지(en)In a significant breakthrough, a POSTECH research team led by Professor Wook-Shin Han (Graduate School of Artificial Intelligence) along with PhD candidate Kyoungmin Kim (Department of Convergence IT Engineering) proposed a novel method for optimal sampling of data stored across various tables. This new technique managed to generate results rapidly. The research was featured at the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2023). This marked a momentous occasion as it was the first instance of a paper from a Korean research team being presented at this prestigious symposium in its 42-year history.

The researchers pioneered a method named degree-based rejection sampling (DRS), which falls under the umbrella of meta-sampling.*1 Conventional methods necessitated pre-calculating probabilities for every value in the sample space*2 before any value could be extracted directly. By contrast, the DRS method proposed by the team initiates with the extraction of a sample space with a simple probability distribution based on the degree of specific values, subsequently drawing values from this sample space. The team convincingly demonstrated that at least one sample space affords a greater probability than the elaborate probabilities computed via traditional methodologies for any random value that can be selected. This implies that values can be obtained with similar probabilities as traditional methods via rejection sampling.*3 In this way, only the probability of extracting a sample space is merely multiplied as a constant value to the probability of sampling a value, avoiding complex probability calculations, and allowing for rapid data sampling.

Moreover, the team employed a technique known as generalized hypertree decompositions (GHDs) to extend the method, which involves analyzing a query*4 in a tree format during the join procedure of integrating tables. If an entire query is processed using a singular join algorithm, it can lead to high time complexity,*5 particularly when the query contains multiple join relations. Using GHDs allows for conducting join operations on smaller sub-queries instead of the entire query, and subsequently combining the results, thereby reducing time complexity. The research team integrated GHDs with DRS to augment the latter, guaranteeing a lower complexity than the original DRS in certain instances.

영문_보오온문 연구 관련 이미지
Heading the research, Professor Wook-Shin Han expressed high hopes for the innovative method, stating: “This technique can be universally applied to all queries, regardless of whether the data structures form a tree, exhibiting hierarchical relationships, or a cycle, depicting circular relationships. It promises to significantly improve both speed and accuracy in the data sampling process for machine learning.”

This study was conducted with the support from the Ministry of Science and ICT and the National Research Foundation of Korea.

1. Meta sampling
A method of extracting the sample space first

2. Sample space
The set of all possible results in a probabilistic experiment

3. Rejection sampling
One of the techniques used when it is difficult to sample according to the desired probability distribution

4. Query
A command used to search or manipulate data

5. Time complexity
A concept that indicates how the execution time of an algorithm increases with the size of the input; the lower the complexity, the more efficient the algorithm is.