githubEdit

Building AI's Temporal Memory Part I: Data Retrieval Methods

Scoping methods for efficient Key-Value dataset Retrieval.

High Latency between workspace kernel or pipelines are out of my expertise at this stage. What I can control is how and what dataset could be stored.

The available tools within my disposal includes Google BigQuery AI/ML. This is not because of the solution being a Non-SQL Database but more because of the ease of statistical methods into aggregation layers of data pipelines for e.g. time horizons methods to learn the last 3 months trends. These are specifically useful in building analytical dashboards or for ones who are too impatient to wait for dataset from the last 3 months period.

If we branch out this idea into vectorized features, one can imagine that a Key-Value data structure for optimal efficiency retrieval entails:

{
    27-01-2026:09:09: {
        np.array([...])
    }
}

This is suitable for cases where the vectors cannot be reversed engineered to its original form such as the latent space of Deep Learner Models. Fortunately, this is not the problem and instead is the idea of understanding patterns/differences between datasets I've observed overtime. This is specifically difficult if you have handled datasets of the same domain - How could you distinguish or categories them with most efficiency in order to conclude with how'd handle the new dataset?

As a human, I am thankful that the weights in my cortex has been automated by simply staying alive and enjoying small glimpses of Life. Though, in my bot's case I am unsure on how to level his certainity in decision making against his previous observations.

After trial and error, I conclude with two core strategies:

  1. Simple Method: Key-Value Storage of the vectors where keys are timestamps and values store the vectors. Post procedures in optimizing the database would be quantizing them further with sentence-transformers python library. Or Perform PCA to retain subset of features most significant to supporting current decision making strategies.

  2. Easy and Lazy Method: Create my own way of hashing the pattern of column field data types to categories the features.

I decided to conclude with the second method mainly because of the off chance where I would require the vector storages for deep learning models. The following dictionaries defines the ruleset for assigning unique codes for each dataset my agent observes:

# Session Base Dataset
class GenerateCodeDtypeConfig(Enum):
    """ Configuration for generating code based on data types. """
    pandasDtype = {
        "bool":      "OBO",  # boolean
        "datetime":  "ODT",  # datetime64[ns], datetime64[ns, tz]
        "timedelta": "ODT",  # treat as time-like; keep same family if desired
        "numeric":   "NCO",  # all numeric (int/float) without value-based splitting
        "category":  "CBI",  # pandas 'category' dtype (categorical)
        "string":    "OTX",  # pandas StringDtype
        "object":    "OTX",  # object (often text/categorical; we treat as text to avoid heuristics)
        "None":     "OUK"   # unknown / other dtypes
    }
    categoryCode = {
        "numerical": "N",
        "categorical": "C",
        "others": "O"
    }
    subCode = {
        "continuous": "NC",
        "discrete": "ND",
        "nominal": "NN",
        "ordinal": "NO",
        "binary-label": "CB",
        "multi-label": "CM",
        "datetime": "OD",
        "geospatial": "OG",
        "text": "OT",
        "boolean": "OB",
        "identifier": "OI",
    }

The configuration defines the ruleset for computing unique-pattern based codes for each dataset. Every dataset received by my agent would have 3 codes. Each Data type code is always suffixed by integers indicating the total data columns containing values of that data type. For example, if a dataset has 3 numerical columns, 1 categorical and 2 others the code would be: N3C1O2.

In summary:

  • Post processed features: categoryCode and subCode maps are parent and child data types, respectively.

  • Pre processed features: pandasDtype maps original data fields

Forgetting about vector database retrievals and sticking with fetching and appending .txt files may sound like a boring route. But in most cases in life, boring is good and simplicity always buys us more time.

Example Codes generated. fieldPattern value displays the combined pattern and child data type categories.

For now, I guess that is sufficient as his temporal lobe. Given the time spent on this project, I am certain that the system architecture would support any new features I introduce in the future.

Last updated

Was this helpful?