Peercode
MyBit Group

Analysis: Claude vs DeepSeek, who is better at gathering flight data?

Research question: Given an e-mail with flight information, to what extent can we use Claude 3.5 Sonnet’s and DeepSeek-V3’s generative ai to extract details into a structured format for subsequent use?

DeepSeek V3 vs Clause Sonnet

Approach: We gathered 14 emails that contained flight information and 2 emails that didn’t contain any flight information, we then parsed the information to subtract things like html tags and converted the emails into text files. The script also removed some larger chunks of data (like luggage policies) so that everything would fit within Claude’s and DeepSeek’s max text limit. In both cases the chat versions of the models were used as DeepSeek’-R1 was down for over a week at the writing of this document. We feed the text into the AI model, preceded by the following prompt:

Given the following e-mail content, extract in JSON the following details for each flight in the mail: flight number, departure datetime, departure airport IATA code, arrival datetime, arrival airport IATA code, confirmation number, price.
If a value is not present in the e-mail, indicate this with a null value. Take into account that the flight number may or may not contain letters.
For each flight extract a list of passengers including the name of each passenger. Please take good care when looking at the price, if its not clear which amount is the price because there are multiple prices listed, go for the highest price. only do this if its not clear what the price is however. There might not be an exact flight_name value but then just gather the start and end location listed in the email and format them in a “x to y” format, dont make up locations however.
Provide a JSON list like the following:
[
    {
        “flight_name”: “Lilly Lake to Metlakatla”,
        “flight_number”: “FLGT0001”,
        “departure_datetime”: “2024-03-30 18:10:00”,
        “departure_iata”: “FQC”,
        “arrival_datetime”: “2024-03-30 23:24:00”,
        “arrival_iata”: “MTM”,
        “confirmation_number”: “9700012455”,
        “price”: “$261”,
        “passengers”: [
            “John Doe”,
            “Karin Doe”
        ]
    }
]

The e-mail content to process starts below the next line.
—  

This is a slightly different prompt than the one used in previous tests. Previous tests showed that DeepSeek would accurately get all flight data except for the price, this is why “Please take good care when looking at the price, if it’s not clear which amount is the price because there are multiple prices listed, go for the highest price. only do this if its not clear what the price is however.” Was added to the prompt. This, however, for some reason resulted in DeepSeek no longer returning the name of the flight. This is why “There might not be an exact flight_name value but then just gather the start and end location listed in the email and format them in a “x to y” format, dont make up locations however.” Was added to the prompt as well.

After also manually extracting and listing all relevant fields from the e-mails, we set up an automated script that compares the AI results against the true (manual) reference. The comparison checks for the following:

  • The returned data is a list of flight details
  • The returned data lists exactly the number of flights that are present in the e-mail
  • Each item (flight) in the data list is formatted as expected
  • For each flight:
    • The flight number matches (allowing for space differences)
    • The departure date+time is correct and formatted as yyyy-mm-dd hh:mm:ss
    • The departure IATA code is correct
    • The arrival date+time is correct and formatted as yyyy-mm-dd hh:mm:ss
    • The arrival IATA code is correct
    • The confirmation number is correct
    • The expected price (numerical value, in dollars) is contained in the detected price (which may also contain e.g. airmiles)
    • The number of passengers is correct
    • The listed passenger names match the expected values (up to a 50% levenshtein threshold)

We keep track of:

  • Number of successful checks
  • Number of failed checks
  • Types of failed checks

Results

Both Claude 3.5 Sonnet and DeepSeek-V3 show very promising results.

Metric 1: No results

The following table shows the number of emails the AI models could not find any flight details in. (Lower is better)

 Claude 3.5 SonnetDeepSeek-V3
Text00

When an email had information about a hotel booking but not about a flight both models returned a Json string containing the price, date and ‘passengers’ of the booking. The data was still factual and thus not considered wrong.

Metric 2: bad results

 Claude 3.5 SonnetDeepSeek-V3
Text80

The last time this test was done DeepSeek had a high amount of bad results because it couldn’t find the correct price, by changing the prompt to make the AI put extra emphasis on finding the correct price this number dropped to 0. Claude had a couple of different errors however. In 4 cases the model returned the name of the passenger in an incorrect format “JOHN/DOE” rather than “John Doe”. In three cases the model either couldn’t get the correct price (it was one of the prices listed, just not the right one) or the model returned “null” in a case where DeepSeek did find the correct ones. In the last case Claude’s model just couldn’t get some values that were clearly listed in the email.

Metric 3: No Errors

These are the total number of flights which were returned correctly.

 Claude 3.5 SonnetDeepSeek-V3
Text1321

Metric 4: Total number of successfully detected details

The following table shows the total number of correct details detected in emails, summed over all emails. (Higher is better)

 Claude 3.5 SonnetDeepSeek-V3
Text179189

Compared to the total number of 189 checks/details, the success percentages are as follows:

 Claude 3.5 SonnetDeepSeek-V3
Text94.7%100%

Considerations

Compared to the last test Claude did a slightly worse job (~3%). The difference in prompt (most likely the length of the prompt) could be the reason for this. It should however be noted that some of the errors Claude made did not correlate with the changes in prompt (returning “Null” for values that can be found within the email), and it could very well be that Claude just simply is less consistent than DeepSeek.

It should also be noted that DeepSeek-V3 is considerably cheaper than Claude with its price being $1.10 per 1 million output tokens compared to Claude’s $15 per 1 million output tokens. (Models & Pricing | DeepSeek API Docs, z.d.) (Pricing, z.d.)

Conclusion

Based on the provided flight emails, Claude 3.5 Sonnet did worse than DeepSeek-V3 with a difference of 5.3% (94.7% vs 100%). The errors in the data returned by Claude was inconsistent and although the error margin could probably be made smaller by changing the prompt in Claude’s favor the chance of Claude returning the correct data 100% of the time is unlikely. DeepSeek-V3 is the more consistent and accurate model.

But what about privacy & security?
While DeepSeek-V3 delivers impressive results, it’s worth noting that it is Chinese-owned, raising questions about data privacy, compliance, and security. For organizations handling sensitive or regulated data, this could be a critical consideration.

AI models evolve constantly
These results are a snapshot in time, not the final word. Model updates, API changes, and new competitors could shift the landscape completely in just a few months. What works best today might not be the winner tomorrow!

Next steps

Eager to discuss opportunities how generative AI can improve your business processes? Don’t hesitate to call us!

Sources

Models & Pricing | DeepSeek API Docs. (z.d.). https://api-docs.deepseek.com/quick_start/pricing

Pricing. (z.d.). https://www.anthropic.com/pricing#anthropic-api

Geldermalsen
Peercode

Peercode

Geldermalsen

Peercode

Peercode helpt uw kennis op het gebied van sport en gezondheid om te zetten in inspirerende websites en applicaties.
  • Laravel

naar de website