# Kaggle Competition: Make Data Count

## Competition Objectives

The competition entails the following objectives:

1. Identifies data citations in scientific literature (from full text, e.g. PDFs / XML). In other words: find where a paper mentions some dataset or data resource.
2. Classifies the kind of citation into two categories:
   * Primary: the data was generated or produced as part of the paper’s own study.&#x20;
   * Secondary: the data was reused, derived, or taken from elsewhere (existing datasets)

## Kaggle Workspace Repo

{% embed url="<https://github.com/whoamimi/kaggle-kernel/blob/main/make-data-count-workbook-submission.ipynb>" %}

## My Grief

My model plan was vectorizing the research papers and performing Self-organizing maps based on the cosine similarity scores. Given that the dataset's references were listed at random throughout the paper, I thought it would be easier to just segment the papers by some integer, `K`, as primary / secondary link detection. In other words, the training data for one research paper would hold the shape: `[1, K, 384]`

* 1 = batch size
* K = equally segmented data
* 384 = Default hidden dimension size of the `sentence-transformers` encoder model. In this case, `all-MiniLM-L6-v2`  was used and its hidden size is 384 (the dimension of the vector when a sentence is encoded /vectorized i.e. the dimensions of its form represented in latent space)

Unfortunately, ran into server / Kaggle workspace configuration errors so wasn't able to submit the complete model, but the data extraction (objective point 1) was completed.&#x20;

Despite the competition being over, I am still eager on trying. Hopefully, will be able to complete this model when I find the time to.&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://whoamimi.gitbook.io/blog/side-projects/2025-competitions/make-data-count-kaggle-competition.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
