LitCoin NLP Challenge: Part 1

Identify mentions of biomedical entities in research abstracts

NCATS & NASA

3 Participants

6 Submissions

Brief Data Breakdown Rules Timeline FAQ Prizes

Brief

This 2-phase competition is part of the NASA Tournament Lab and hosted by NCBI (The National Center for Biotechnology Information), NCATS (The National Center for Advancing Translational Sciences) and NIH (National Institutes of Health). These institutions, in collaboration with bitgrit and CrowdPlat, have come together to bring you this challenge where you can deploy your data-driven technology solutions towards accelerating scientific research in medicine and ensure that data from biomedical publications can be maximally leveraged and reach a wide range of biomedical researchers. Each phase of the competition is designed to spur innovation in the field of natural language processing, asking competitors to design systems that can accurately recognize scientific concepts from the text of scientific articles, connect those concepts into knowledge assertions, and determine if that claim is a novel finding or background information. Part 1: Given only an abstract text, the goal is to find all the nodes or biomedical entities (position in text and BioLink Model Category). Part 2: Given the abstract and the nodes annotated from it, the goal is to find all the relationships between them (position in text and BioLink Model Predicate). *NOTE: The prizes listed will be awarded based on a competitor’s combined, weighted scores from both phases of the competition. Please see the Rules section for more information.* The National Center for Advancing Translational Sciences (NCATS, a center of the National Institutes of Health): NCATS, is conducting this challenge under the America Creating Opportunities to Meaningfully Promote Excellence in Technology, Education, and Science (COMPETES) Reauthorization Act of 2010. This challenge will spur innovation in NLP to advance the field and allow the generation of more accurate and useful data from biomedical publications, which will enhance the ability for data scientists to create tools to foster discovery and generate new hypotheses. The National Center for Biotechnology Information (NCBI, part of the National Library of Medicine, a division of the National Institutes of Health): NCBI intramural researchers and their collaborators have provided a corpus of annotated abstracts from published scientific research articles and knowledge assertions between these concepts, which will be provided to participants for training and testing purposes. CrowdPlat (Project Company): The LitCoin project was awarded to and is being managed by CrowdPlat under NASA's NOIS2 contract. Located in San Jose, California; CrowdPlat provides crowdsourcing solutions to medium to large scale enterprises seeking project execution through a crowdsourced talent pool.

Prizes

1st Prize ($ 35000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!
2nd Prize ($ 25000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!
3rd Prize ($ 20000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!
4th Prize ($ 5000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!
5th Prize ($ 5000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!
6th Prize ($ 5000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!
7th Prize ($ 5000)
The prize money displayed is the total prize for both phases of the LitCoin NLP Challenge. Please see the Rules section for more info!

Timeline

09 Nov 2021 Competition Phase-1 Starts
23 Dec 2021 Competition Phase-1 Ends
28 Dec 2021 Competition Phase-2 Starts

Data Breakdown

The goal of the first part of the LitCoin Challenge is to identify the location and type of biomedical concepts (entities) within a research paper’s title and abstract. The location of the biomedical entities is determined by two ‘offset’ numbers, ‘offset_start’ and ‘offset_finish’ that indicate the index or position where the concept substring starts and the index or position where the concept substring ends, respectively. The string considered for these indexes is the concatenation of the title + abstract strings. The number of entities in each abstract is not fixed. The type of the biomedical entities comes from the BioLink Model Categories, and can be any of the following: ・DiseaseOrPhenotypicFeature ・ChemicalEntity ・OrganismTaxon ・GeneOrGeneProduct ・SequenceVariant ・CellLine To properly understand these concepts it would be helpful to get familiar with the concept of biomedical ontologies: [LINK] There are the following files included in the dataset: ・abstracts_train.csv: # abstract_id: ID of the research paper # title: title of the research paper # abstract: abstract of the research paper ・entities_train.csv: CSV file containing all the entities (as a pair of offset and a type) found in the abstracts that can be used for training. # id: # abstract_id: # offset_start: # offset_finish: # type: # entity_id: ・relations_train.csv: CSV file containing all the relations () found in the abstracts that can be used for training FOR THE SECOND PART OF THE COMPETITION. # id: # entity_id_1: # entity_id_2: # relation_type: # novel: ・abstracts_test.csv: # abstract_id: ID of the research paper # title: title of the research paper # abstract: abstract of the research paper The evaluation metric for this problem is a modified version of the Jaccard Similarity Score: ・For an abstract_id A, a set of predicted concepts P and a set of actual original concepts O, the formula is: |P| + |O| - |P⋂O| / |P⋃O|, where ⋂ means intersection, ⋃ means union, || means length or amount of concepts and where matching concepts in the intersection is determined by having the same (or very similar) offsets and the same type. The Jaccard Similarity Scores for each abstract are then averaged to return the final score. Final competition results are based on competitor’s combined, weighted scores from both phases of the competition. Winners will be determined by a weighted average of scores from the two competition phases: 30% of the total score will be determined by problem statement 1 and 70% of the total score will be determined by problem statement 2. *NOTE: WHEN YOU UPLOAD YOUR SUBMISSION, IT MIGHT APPEAR AN ERROR SAYING "INVALID FILE". PLEASE IGNORE IT AND CONFIRM IF YOUR SUBMISSION WAS SCORED.*

FAQs

Who do I contact if I need help regarding a competition?

If you have any inquiries about participating in this competition, please don’t hesitate to reach out to us at [email protected]. For questions about eligibility or prize distribution, email NCATS at [email protected].

How will I know if I’ve won?

If you are one of the top seven winners for this competition, we will email you with the final result and information about how to claim your reward.

How can I report a bug?

Please shoot us an email at [email protected] with details and a description of the bug you are facing, and if possible, please attach a screenshot of the bug itself.

If I win, how can I receive my reward?

The money prize will be awarded by NIH/NCATS directly to the winner (if an individual) or Team Lead of the winning team (if a team). Please check rule number 7 for eligibility information. Prizes awarded under this Challenge will be paid by electronic funds transfer and may be subject to Federal income taxes. HHS/NIH will comply with the Internal Revenue Service withholding and reporting requirements, where applicable.

Rules

1. This competition is governed by the following Terms of Participation (“Participation Rules”). Participants must agree to and comply with the Participation Rules to compete. 2. This competition consists of 2 problem statements, herein considered as competition sub-phases. Winners will be determined by a weighted average of scores from the two competition phases: 30% of the total score will be determined by problem statement 1 and 70% of the total score will be determined by problem statement 2. 3. The competition dates are detailed below: Phase 1 Start Date: November 9th, 2021 Phase 1 Closing Date: December 23rd, 2021 Phase 2 Start Date: December 28th, 2021 Phase 2 Closing Date: February 28th, 2022 Submission (Final Source Code): March 11th, 2022 Winner’s Announced: April 8th, 2022 4. Participants are allowed to participate in an individual capacity or as part of a team. 5. It is not allowed to merge teams midway through the competition. 6. Each participant may only be a member of a single team and may not participate as individuals and on a team simultaneously. 7. In order to participate in this competition and be eligible for the prize money, participants must be a U.S. citizen or a U.S. permanent resident. Non-U.S. citizens and non-permanent residents can participate as well, as a member of a team that includes a citizen or permanent resident of the U.S, or they can participate on their own. However, such non-U.S. citizens and non-permanent residents are not eligible to win a monetary prize (in whole or in part). Their participation as part of a winning team, if applicable, may be recognized when the results are announced. Similarly, if participating on their own, they may be eligible to win a non-cash recognition prize. Proof of citizenship and permanent residency will be required. For more information on competition eligibility requirements, please see https://ncats.nih.gov/challenges/litcoin 8. In the case of a team participation, all submissions must be made by the team lead. 9. The use of external datasets for the purposes of training is allowed, but submissions must be generated using the test corpus provided. 10. During the competition period, participants will be allowed to submit a maximum number of 5 submissions per day. If participants exceed the set submission limit, the platform will be reset to allow additional 5 submissions the following day. Please keep this in mind when uploading a submission file. Any attempt to circumvent stated limits will result in disqualification. 11. Participants are not permitted to share or upload the competition dataset to any platform outside of competition. Participants that do not comply with the confidentiality regulations of the competition will be disqualified. 12. The top seven (7) winning participants will be eligible to receive a competition prize (ranked by performance) after we have received, successfully executed, and confirmed the validity of both the code and the solution (See 14.). In order to ensure that at least 7 participants may be awarded prizes, the top fifteen (15) individuals/teams will be asked to submit their source code for evaluation (see 13.). 13. Once potential competition winners are determined and our team reaches out to them, the top scoring participants must provide the following by March 11, 2022 for evaluation to be qualified as competition winner(s) and receive their prize: Winning Model Documentation template filled in (this document is available on the “Resources” tab on the competition page) b. All source files required to preprocess the data c. All source files required to build, train and make predictions with the model using the processed data d. A requirements.txt (or equivalent) file indicating all the required libraries and their versions as needed e. A ReadMe file containing the following: • Clear and unambiguous instructions on how to reproduce the predictions from start to finish including data pre-processing, feature extraction, model training and predictions generation • Environment details regarding where the model was developed and trained, including OS, memory (RAM), disk space, CPU/GPU used, and any required environment configurations required to execute the code • Clear answers to the following questions: - Which data files are being used? - How are these files processed? - What is the algorithm used and what are its main hyperparameters? - Any other comments considered relevant to understanding and using the model 14. Solution submissions should be able to generate the exact output that gives the corresponding score on the leaderboard. If the score obtained from the code is different from what’s shown on the leaderboard, the new score (which may be lower) will be used for the final rankings unless a logical explanation is provided. Please make sure to set the seed or random_state etc. so we can obtain the same result from your code. 15. Solution submissions will also be used to generate output based on a validation dataset, generated in the same manner with which the provided test and training sets were generated, which will be kept hidden from all participants, in order to verify that code was not customized for the provided dataset. This output will not be used to determine leaderboard position, but could be used to disqualify a participant from receiving a prize if the output is judged to be severely inaccurate by bitgrit, CrowdPlat and NCATS. 16. In order to be eligible for the prize, a competition winner (whether an individual, group of individuals, or entity) must agree to grant to the NIH an irrevocable, paid-up, royalty-free non-exclusive worldwide license to reproduce, publish, post, link to, share, and display publicly the submission on the web or elsewhere, and a nonexclusive, non transferable, irrevocable, paid-up license to practice or have practiced for or on its behalf, the solution throughout the world. For more detailed information, please visit http://ncats.nih.gov/challenges/litcoin. 17. Any prize awards are subject to verification of eligibility and compliance with these Participation Rules. Novelty and innovation of submissions may also affect the final ranking. All decisions of bitgrit, CrowdPlat and NCATS will be final and binding on all matters relating to this Competition. 18. Cash prizes will be paid directly by NIH/NCATS to the competition winners. In the case of a winning team, the money prize will be paid directly by NIH/NCATS to the Team Lead. Non-U.S. citizens and non-permanent residents are not eligible to receive a cash prize (in whole or in part). Their participation as part of a winning team, if applicable, may be recognized when the results are announced. Prizes awarded under this Challenge will be paid by electronic funds transfer and may be subject to local, state, federal and foreign tax reporting and withholding requirements. HHS/NIH will comply with the Internal Revenue Service withholding and reporting requirements, where applicable. 19. If two or more participants have the same score on the leaderboard, an earlier submission will take precedence and be ranked higher than a later submission. 20. If you have any inquiries about participating in this competition, please don’t hesitate to reach out to us at [email protected]. For questions about eligibility or prize distribution, email NCATS at [email protected].

New Submission

Step 1

Upload or drop your file

Upload or drop your csv file here.

Your submission should be in .csv format.

Step 2

Description

Briefly describe your submission (400 characters or less)

You have exceeded the number of allowed submissions for this competition.

5 submission(s) left

Thanks for your submission!

We'll send updates to your email. You can check your email and preferences here.

My Submissions

Japan Office
+81 3 6671 8256
Koganei Building 4th Floor, 3-4-3 Kami-Meguro,
Meguro City, Tokyo, Japan

UAE Office
DD-14-122-070, WeWork Hub 71 Al Khatem Tower,
ADGM Square Al Maryah Island, Abu Dhabi, UAE

LitCoin NLP Challenge: Part 1

Identify mentions of biomedical entities in research abstracts

Brief

Prizes

1st Prize ($ 35000)

2nd Prize ($ 25000)

3rd Prize ($ 20000)

4th Prize ($ 5000)

5th Prize ($ 5000)

6th Prize ($ 5000)

7th Prize ($ 5000)