<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Neva Erdogan on Medium]]></title>
        <description><![CDATA[Stories by Neva Erdogan on Medium]]></description>
        <link>https://medium.com/@nevardogan?source=rss-b1544ba69283------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*M3ZV57dz_P7XqIHhajt25Q.jpeg</url>
            <title>Stories by Neva Erdogan on Medium</title>
            <link>https://medium.com/@nevardogan?source=rss-b1544ba69283------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sat, 23 May 2026 15:07:45 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@nevardogan/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Why Is ChatGPT So Polite If It Learned From Reddit?]]></title>
            <link>https://medium.com/@nevardogan/why-is-chatgpt-so-polite-if-it-learned-from-reddit-30e2ce07d0c6?source=rss-b1544ba69283------2</link>
            <guid isPermaLink="false">https://medium.com/p/30e2ce07d0c6</guid>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[chatgpt]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[rlhf]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Neva Erdogan]]></dc:creator>
            <pubDate>Wed, 11 Feb 2026 20:28:46 GMT</pubDate>
            <atom:updated>2026-02-11T20:28:46.590Z</atom:updated>
            <content:encoded><![CDATA[<p>If large language models (LLMs) are trained on massive portions of the internet; Reddit threads, comment sections, forums, blogs, documentation, and books, then why don’t they behave like the internet?</p><p>Why aren’t they sarcastic, chaotic, impulsive, or outright toxic?</p><p>The short answer is exactly what you hinted at:</p><p><strong>The internet made LLMs smart. Alignment made them usable.</strong></p><p>But the real story is much more technical and much more interesting.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*RUD6QSarQWRfzWAw_aAQKg.png" /></figure><h3>The Two-Stage Myth And the Real Pipeline</h3><p>People often simplify LLM training into two steps:</p><ol><li>Train on the internet</li><li>Apply RLHF</li></ol><p>This is directionally correct, but technically incomplete.</p><p>The real pipeline looks closer to this:</p><p><strong>Pretraining → Supervised Fine-Tuning → Reward Modeling → Reinforcement Learning → Evaluation &amp; Iteration</strong></p><p>Each stage transforms a raw text predictor into something that behaves like an assistant.</p><h3>Stage 1 — Pretraining: Teaching a Neural Network Language</h3><p>At their core, large language models are probability machines.</p><p>They don’t “know” facts.<br> They don’t “understand” meaning.</p><p>They learn statistical patterns.</p><p>LLMs are trained on enormous text datasets and learn to <strong>predict the next token in a sequence</strong>.</p><p>In other words:</p><blockquote><em>They are not answering questions — they are continuing text.</em></blockquote><p>This distinction is critical.</p><p>From a machine learning perspective, pretraining is typically:</p><ul><li>Self-supervised learning</li><li>Transformer-based architecture</li><li>Gradient descent optimization</li><li>Massive distributed training</li></ul><p>The model approximates the probability distribution of language itself.</p><h3>The Problem With Raw Pretrained Models</h3><p>A pretrained model can:</p><ul><li>generate grammatical text</li><li>imitate styles</li><li>write essays</li><li>produce code</li></ul><p>But it is <strong>not inherently helpful.</strong></p><p>Without alignment:</p><ul><li>It may ignore user intent</li><li>Produce irrelevant answers</li><li>Mirror biases in training data</li><li>Generate unsafe content</li></ul><p>Pretrained models can output text that is “technically correct but irrelevant, incoherent, or even unsafe.”</p><p>Why?</p><p>Because language modeling optimizes for <strong>likelihood</strong>, not <strong>quality</strong>.</p><p>Truthfulness, usefulness, and safety are not easily encoded as mathematical loss functions.</p><p>So researchers needed a way to teach models something abstract:</p><blockquote><em>What does a </em>good<em> answer look like?</em></blockquote><p>Enter alignment.</p><h3>Stage 2 — Supervised Fine-Tuning (The Hidden Middle Layer)</h3><p>Before RLHF even begins, most models undergo supervised instruction tuning. Humans write example prompt–response pairs.<br> The model learns to imitate them.</p><p>This step:</p><ul><li>teaches formatting</li><li>improves instruction following</li><li>shapes conversational behavior</li></ul><p>Both the reward model and RL policy are usually initialized from this pretrained base.</p><p>Think of it as teaching the AI basic manners before judging its performance.</p><h3>Stage 3 — Reward Modeling: Turning Human Taste Into Math</h3><p>Now comes the clever trick.</p><p>Instead of hand-writing rules for “good behavior,” researchers train a <strong>reward model</strong>.</p><p>Here’s how it works:</p><p>Humans compare outputs and rank them. The reward model learns:</p><blockquote><em>Which response would a human prefer?</em></blockquote><p><strong>RLHF</strong> leverages these preference signals to align models with human expectations.</p><p>This converts subjective human judgment into a differentiable optimization target. A massive conceptual breakthrough.</p><p>Because now:</p><p><strong>Politeness becomes a gradient.</strong></p><h3>Stage 4 — Reinforcement Learning (The Alignment Engine)</h3><p>Reinforcement learning trains the model to maximize reward.</p><p>Two models interact:</p><ul><li><strong>Policy model</strong> → generates responses</li><li><strong>Reward model</strong> → scores them</li><li><strong>Value Model</strong> → Estimates expected future rewards for variance reduction</li></ul><p>The policy updates itself to produce higher-scoring outputs.</p><p>Most modern RLHF pipelines use algorithms like <strong>Proximal Policy Optimization (PPO)</strong> to stabilize training and prevent the model from changing too drastically.</p><p>Why stability matters:</p><p>If optimization is unconstrained, models may “game” the reward function and produce nonsense.</p><p>PPO clips updates to keep the new policy close to the original.</p><p>This is classic reinforcement learning, but applied to language.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/850/1*WSqDSIT04mo4vzPThxByhw.png" /><figcaption>RLHF Pipeline Diagram</figcaption></figure><h3>Why RLHF Exists: Scale Breaks Rule-Based Safety</h3><p>Modern LLMs have billions of parameters. You cannot hard-code behavior at that scale.</p><p>Rule-based filtering cannot capture the nuance of human preference.</p><p>RLHF introduces a human-guided reward signal that makes outputs more consistent with real-world expectations.</p><p>It essentially transforms a static text generator into an interactive system.</p><h3>Does RLHF Actually Work?</h3><p>Empirically ; yes.</p><p>Examples:</p><ul><li>RLHF improved instruction-following by <strong>17%</strong></li><li>Reduced toxic outputs by <strong>48%</strong> in one ChatGPT lineage comparison</li></ul><p>It can even outperform much larger models without alignment:</p><p>Human evaluators preferred a <strong>1.3B parameter RLHF model</strong> over outputs from a <strong>175B parameter raw model</strong>.</p><p>Alignment can matter more than scale. That is a huge shift in AI research.</p><h3>Mixing Objectives — Avoiding Catastrophic Forgetting</h3><p>There’s another subtle technical detail many people miss.</p><p>During RLHF, models still train partly on original language data to prevent “forgetting” general knowledge.</p><p>Otherwise a customer-service tuned model might literally forget geography.</p><p>This hybrid objective balances:</p><ul><li>human alignment</li><li>linguistic competence</li></ul><p>An elegant engineering compromise.</p><h3>The Core Insight: Intelligence ≠ Alignment</h3><p>Pretraining creates <strong>capability</strong>.</p><p>RLHF creates <strong>behavior</strong>.</p><p>Without alignment, LLMs optimize for engagement patterns found online, not for truth or usefulness.</p><p>With alignment, they optimize for human preference.</p><p>This is why modern AI feels cooperative rather than chaotic.</p><h3>But RLHF Isn’t Perfect</h3><p>There are real technical challenges.</p><h4>1. Human Bias Transfers Into Models</h4><p>Human judgments are subjective.</p><p>Biases in ratings can propagate into the reward model.</p><p>Alignment is not neutrality. It is curated behavior.</p><h4>2. Reward Models Can Misinterpret Preferences</h4><p>Incorrect or ambiguous preference data can degrade performance.</p><p>Generalization outside the training distribution is also difficult.</p><p>Classic ML problem. New domain.</p><h4>3. Helpfulness vs Harmlessness Is a Tradeoff</h4><p>Balancing safety with usefulness is an optimization challenge.</p><p>Some research frames this as maximizing reward while satisfying cost constraints using methods like the Lagrangian approach.</p><p>Alignment is literally a constrained optimization problem.</p><h3>Beyond RLHF — The Next Wave</h3><p>The field is already evolving.</p><p>Researchers are exploring:</p><ul><li>AI-generated critiques</li><li>Natural language feedback</li><li>Automated reward systems</li></ul><p>One study showed iterative critique could boost response win rates to <strong>65.9%</strong>.</p><p>There is also movement toward reducing human labeling costs via reinforcement-based automation.</p><p>The future may involve <strong>AI aligning AI.</strong></p><h3>The Big Misconception: “It Was Trained on Reddit”</h3><p>Yes : internet-scale corpora are used in pretraining.</p><p>But raw data does <strong>not</strong> define final behavior.</p><p>Post-training reshapes the model.</p><p>RLHF steers responses toward:</p><ul><li>relevance</li><li>safety</li><li>helpfulness</li><li>instruction-following</li></ul><p>Think of pretraining as raising a child in the noise of the world. Alignment is education.</p><h3>The Deeper Philosophical Takeaway</h3><p>Modern AI development has quietly shifted focus from:</p><p><strong>“How do we make models smarter?”</strong></p><p>to:</p><p><strong>“How do we make models behave?”</strong></p><p>Capability scaling was the first revolution.</p><p>Alignment is the second.</p><p>And arguably the harder one.</p><p>Because intelligence is mathematical. Values are not.</p><h3>Final Thought</h3><p>So why is ChatGPT polite despite learning from the internet?</p><p>Because modern AI is not just trained.</p><p>It is curated.</p><p>Not just optimized for probability, <br> but optimized for preference.</p><blockquote>The internet gave language models their voice.</blockquote><blockquote><strong>RLHF taught them how to speak to humans.</strong></blockquote><p>Resources:</p><ul><li><a href="https://www.ibm.com/think/topics/rlhf">What Is Reinforcement Learning From Human Feedback (RLHF)? | IBM</a></li><li><a href="https://huggingface.co/blog/rlhf">Illustrating Reinforcement Learning from Human Feedback (RLHF)</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=30e2ce07d0c6" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Why Your Spotify Discover Weekly Actually Slays (and how it’s not magic)]]></title>
            <link>https://medium.com/@nevardogan/why-your-spotify-discover-weekly-actually-slays-and-how-its-not-magic-2fb2655948b4?source=rss-b1544ba69283------2</link>
            <guid isPermaLink="false">https://medium.com/p/2fb2655948b4</guid>
            <category><![CDATA[spotify]]></category>
            <category><![CDATA[user-based-cf]]></category>
            <category><![CDATA[recommendation-system]]></category>
            <category><![CDATA[item-based-cf]]></category>
            <category><![CDATA[data-science]]></category>
            <dc:creator><![CDATA[Neva Erdogan]]></dc:creator>
            <pubDate>Tue, 27 Jan 2026 23:13:50 GMT</pubDate>
            <atom:updated>2026-01-27T23:13:50.659Z</atom:updated>
            <content:encoded><![CDATA[<h3>The Technical Deep Dive</h3><p>Ever wondered why your streaming app knows you’re in your “sad girl autumn” era before you even do? It’s not a glitch in the simulation, bestie. It’s <strong>Recommendation Systems</strong> doing the most behind the scenes.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/757/1*OhcIJ8nG0x1r7EFyG0Ystw.png" /></figure><p><strong>1. Content-Based Filtering: The Solo Stacker</strong></p><p>This is the “if you like this, you’ll like that” energy, but powered by heavy NLP. To understand a piece of content, the algo doesn’t just “look” at it; it vectorizes it. We use <strong>Count Vectors</strong> to map word frequencies, but the real star is <strong>TF-IDF (Term Frequency-Inverse Document Frequency)</strong>.</p><p>It calculates a weight for each token:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/322/1*cfci8qoSmKc4YJzVcW_3qQ.png" /></figure><p>This ensures that common words (like “the” or “song”) don’t drown out the unique signals (like “hyperpop” or “lo-fi”). Once we have these high-dimensional vectors, we calculate the <strong>Cosine Similarity</strong>. By finding the dot product of two normalized vectors, the system measures the “vibe distance.” If the score is close to 1, you’re getting that glass-skin content served on a silver platter. No thoughts, just optimized metadata.</p><p><strong>2. Collaborative Filtering: The “Mutuals” Method</strong></p><p>Then there’s <strong>Collaborative Filtering</strong>, which is basically the digital version of “I’ll have what she’s having.” We’re moving from item features to a massive <strong>User-Item Matrix</strong>.</p><ul><li><strong>Item-Based:</strong> Instead of looking at you, we look at the items’ track records. If a huge cluster of users gave 5 stars to both “Oversized Hoodies” and “Baggy Jeans,” the system calculates the similarity between these two item-vector columns. It’s a more stable approach because item ratings don’t change as fast as human moods.</li><li><strong>User-Based:</strong> This is finding your “taste twins.” The system searches for users whose rating vectors have the highest correlation with yours. If you and a random person in Berlin both stan the same 5 niche indie artists, the algo performs a weighted average of their other favorites to predict your next obsession. It’s giving soulmate behavior, but it’s actually just <strong>Pearson Correlation Coefficient</strong> at work.</li></ul><blockquote>Your feed isn’t random; it’s an ensemble of distance metrics and sparse matrix operations designed to keep your retention rate at an all-time high. Period. 💅</blockquote><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=2fb2655948b4" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[ POV: You’re Finally Learning Data Literacy Because Numbers Don’t Lie (But People Do)]]></title>
            <link>https://medium.com/@nevardogan/pov-youre-finally-learning-data-literacy-because-numbers-don-t-lie-but-people-do-7f879208e14c?source=rss-b1544ba69283------2</link>
            <guid isPermaLink="false">https://medium.com/p/7f879208e14c</guid>
            <dc:creator><![CDATA[Neva Erdogan]]></dc:creator>
            <pubDate>Fri, 23 Jan 2026 19:20:20 GMT</pubDate>
            <atom:updated>2026-01-23T19:35:17.896Z</atom:updated>
            <content:encoded><![CDATA[<h3>👀POV: You’re Finally Learning Data Literacy Because Numbers Don’t Lie (But People Do)</h3><p>​We are living in an era where everyone is obsessed with &quot;data-driven&quot; decisions, but half the time, people are just throwing charts around to justify whatever they already wanted to do. If you don’t know how to read the room (or the data), you are going to get played.<br>​So I went deep into the technical side of things to gatekeep the truth from the noise. Here is the technical breakdown of data literacy, translated for those of us who want the tea without the academic boredom.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*HnG7byjtqeFZRi3NwuQK5g.jpeg" /></figure><p>​<strong><em>The Main Characters: Population vs. Sample</em></strong><br>​Think of the Population as the entire fandom. It is every single possible data point in existence for your study. But obviously, you cannot interview every single person on the planet. That is where the Sample comes in. The sample is the specific group chat you actually have access to. If your sample is messy, your insights are going to be delulu. You need a sample that actually represents the population, or else you are just projecting.</p><p>​<strong><em>The Vibe Check: Central Tendency</em></strong><br>​When we look at a dataset, the first thing we want to know is &quot;what is the general vibe here?&quot; That is Central Tendency. But we have three different ways to measure it, and they all spill different tea.<br>​<strong>Mean (Arithmetic Average):</strong> This is the basic average. It is sensitive, though. One massive outlier (like a billionaire walking into a room of students) can skew the whole number and ruin the vibe.<br>​<strong>Median:</strong> This is the unbothered middle child. It literally sits in the center of the data when sorted. It does not care about that one billionaire outlier. It is usually more honest than the mean.<br>​<strong>Mode:</strong> The specific value that shows up the most. It is the trendsetter.</p><p>​<strong><em>The Drama: Dispersion and Spread</em></strong><br>​Knowing the average is cute, but it doesn’t tell you about the chaos. We need to know how spread out the data is.<br>​<strong>Range &amp; Quartiles:</strong> Range is just the distance between the best and the worst. Quartiles break the data into four parts so you can see where the top 1% (or top 25%) are actually sitting.<br>​<strong>Variance &amp; Standard Deviation:</strong> This is where it gets technical. Variance measures the average squared deviation from the mean. Basically, how far is everyone drifting from the center? Standard Deviation is just the square root of variance, bringing it back to the original units. If your standard deviation is high, your data is chaotic and unpredictable. If it is low, everyone is acting the same.</p><p>​<strong><em>The Aesthetic: Shape of the Data</em></strong><br>​Not all data looks like a perfect bell curve. Sometimes it has issues.<br>​<strong>Skewness :</strong> This tells you if the data is leaning too hard to one side. If it is right-skewed, the tail drags out to the right (positive), meaning most people are low, but a few high rollers are stretching the graph.<br>​<strong>Kurtosis :</strong> This measures the &quot;peak&quot; intensity. High kurtosis (Leptokurtic) means everything is clustered around the center with heavy tails (lots of outliers). Low kurtosis (Platykurtic) means the curve is flat and chill.</p><p>​<strong><em>The Conclusion</em></strong><br>​Data literacy isn’t just about making pretty charts using visualization tools. It is about understanding the underlying statistical models, recognizing the variable types (nominal, ordinal, interval, ratio), and knowing when a correlation is actually a coincidence.</p><p>​Next time someone shows you a &quot;statistic,&quot; check their standard deviation and ask about their skewness before you believe the hype.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7f879208e14c" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[From Behavioral Segmentation to Value Prediction: Two Customer Analytics Projects]]></title>
            <link>https://medium.com/@nevardogan/from-behavioral-segmentation-to-value-prediction-two-customer-analytics-projects-explained-f024d1b4b957?source=rss-b1544ba69283------2</link>
            <guid isPermaLink="false">https://medium.com/p/f024d1b4b957</guid>
            <category><![CDATA[cltv]]></category>
            <category><![CDATA[customer-analytics]]></category>
            <category><![CDATA[customer-segmentation]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[rfm-analysis]]></category>
            <dc:creator><![CDATA[Neva Erdogan]]></dc:creator>
            <pubDate>Fri, 31 Oct 2025 21:28:13 GMT</pubDate>
            <atom:updated>2025-11-08T23:40:12.621Z</atom:updated>
            <content:encoded><![CDATA[<h4>Two projects, 20,000 customers, and a lot of “wait, we can actually predict this?” moments</h4><p>I recently finished two customer analytics projects that completely changed how I think about customer value. Not in a “wow this is revolutionary” way, but in a “why isn’t everyone doing this already?” way.</p><p>The setup: From my dataset; 20,000 OmniChannel customers (people who shop both online and offline) from 2020–2021. The mission: figure out who’s actually valuable and who’s just… there.</p><p>Spoiler: Most customers are just there. Here’s how I proved it with data.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*qKctco8IQwPkoV1k2XgtJA.png" /></figure><h3>Project 1: RFM Analysis (The One Where I Sorted 20,000 People)</h3><h3>What Is RFM?</h3><p>RFM stands for:</p><ul><li><strong>Recency</strong>: How recently did they purchase?</li><li><strong>Frequency</strong>: How often do they purchase?</li><li><strong>Monetary</strong>: How much do they spend?</li></ul><p>It’s literally just scoring customers on these three metrics. No machine learning, no neural networks, just smart scoring.</p><h3>The Method</h3><p>I scored each customer 1–5 on each metric using quartiles:</p><p><strong>Recency</strong> (inverted, lower is better):</p><ul><li>Score 5: Bought within last few weeks</li><li>Score 1: Hasn’t bought in months</li></ul><p><strong>Frequency</strong>:</p><ul><li>Score 5: Tons of purchases</li><li>Score 1: One or two purchases total</li></ul><p><strong>Monetary</strong>:</p><ul><li>Score 5: High spender</li><li>Score 1: Minimal spending</li></ul><p>Then I combined Recency + Frequency into an RF Score. Someone with score “55” is golden (recent + frequent). Someone with “11” is basically gone.</p><pre># The actual scoring code<br>rfm[&quot;recency_score&quot;] = pd.qcut(rfm[&#39;recency&#39;], 5, labels=[5, 4, 3, 2, 1])<br>rfm[&quot;frequency_score&quot;] = pd.qcut(rfm[&#39;frequency&#39;].rank(method=&quot;first&quot;), 5, labels=[1, 2, 3, 4, 5])<br>rfm[&quot;monetary_score&quot;] = pd.qcut(rfm[&#39;monetary&#39;], 5, labels=[1, 2, 3, 4, 5])</pre><pre>rfm[&quot;RF_SCORE&quot;] = (rfm[&#39;recency_score&#39;].astype(str) + <br>                    rfm[&#39;frequency_score&#39;].astype(str))</pre><h3>The 10 Customer Types That Emerged</h3><p>Using regex mapping on RF scores, I identified 10 distinct segments:</p><p><strong>Champions (RF: 54, 55)</strong></p><ul><li>Recent buyers + high frequency</li><li>Your brand ambassadors</li><li>What I did: Targeted them for premium product launches</li></ul><p><strong>Loyal Customers (RF: 34, 35, 44, 45)</strong></p><ul><li>Consistent purchase behavior</li><li>Solid revenue base</li><li>What I did: Loyalty program candidates</li></ul><p><strong>Potential Loyalists (RF: 42, 43, 52, 53)</strong></p><ul><li>Recent customers with potential</li><li>Need nurturing</li><li>What I did: Onboarding campaigns</li></ul><p><strong>New Customers (RF: 51)</strong></p><ul><li>Just showed up, low frequency</li><li>Critical window to convert</li><li>What I did: Welcome series, incentives</li></ul><p><strong>Promising (RF: 41)</strong></p><ul><li>Recent but need frequency boost</li><li>What I did: Repeat purchase campaigns</li></ul><p><strong>Need Attention (RF: 33)</strong></p><ul><li>Average across the board</li><li>What I did: Re-engagement tactics</li></ul><p><strong>About to Sleep (RF: 31, 32)</strong></p><ul><li>Declining engagement</li><li>What I did: Wake-up campaigns</li></ul><p><strong>At Risk (RF: 13, 14, 23, 24)</strong></p><ul><li>Used to be good, now declining</li><li>What I did: Win-back offers</li></ul><p><strong>Can’t Lose Them (RF: 15)</strong></p><ul><li>High spenders who stopped buying</li><li>Code red territory</li><li>What I did: Aggressive retention</li></ul><p><strong>Hibernating (RF: 11, 12, 21, 22)</strong></p><ul><li>Basically inactive</li><li>What I did: Minimal effort or let churn</li></ul><h3>Real Business Applications</h3><p><strong>Case 1: New Women’s Shoe Brand Launch</strong></p><ul><li>Target: Premium brand, above average price point</li><li>Criteria: Champions + Loyal Customers + interested in women’s category</li><li>Method: Filtered RFM segments + category interest from last 12 months</li><li>Output: Customer ID list for targeted campaign</li></ul><pre>target_customers_df = merged_df[<br>    (merged_df[&#39;segment&#39;].isin([&#39;champions&#39;, &#39;loyal_customers&#39;])) &amp;<br>    (merged_df[&#39;interested_in_categories_12&#39;].str.contains(&quot;KADIN&quot;))<br>]</pre><p><strong>Case 2: 40% Discount on Men’s &amp; Kids’ Products</strong></p><ul><li>Target: Win back at-risk customers + activate new ones</li><li>Criteria: cant_loose + at_risk + about_to_sleep + new_customers</li><li>Method: Same filtering with men’s/kids category interest</li><li>Output: Different customer ID list for discount campaign</li></ul><pre>discount_target_df = merged_df[<br>    (merged_df[&#39;segment&#39;].isin([&quot;cant_loose&quot;, &quot;at_risk&quot;, &quot;about_to_sleep&quot;, &quot;new_customers&quot;])) &amp;<br>    (merged_df[&#39;interested_in_categories_12&#39;].str.contains(&quot;ERKEK|COCUK&quot;))<br>]</pre><p>The point: Instead of blasting everyone with everything, I created precision-targeted lists. No wasted budget.</p><h3>Project 2: CLTV Prediction (The One Where I Predicted The Future)</h3><h3>Why CLTV Matters</h3><p>RFM tells you who customers are right now. CLTV (Customer Lifetime Value) predicts who they’ll be in 6 months.</p><p>Knowing future value lets you:</p><ul><li>Allocate marketing budget intelligently</li><li>Identify VIP program candidates</li><li>Spot high-value customers before they churn</li><li>Stop over-investing in low-value customers</li></ul><h3>The Data Structure</h3><p>Before modeling, I created weekly metrics from the raw data:</p><pre>cltv_df[&quot;recency_cltv_weekly&quot;] = round(((df[&quot;last_order_date&quot;] - df[&quot;first_order_date&quot;]).dt.days) / 7)<br>cltv_df[&quot;T_weekly&quot;] = round(((analysis_date - df[&quot;first_order_date&quot;]).dt.days)/7)<br>cltv_df[&quot;frequency&quot;] = df[&quot;order_num_total&quot;]<br>cltv_df[&quot;monetary_cltv_avg&quot;] = df[&quot;customer_value_total&quot;] / df[&quot;order_num_total&quot;]</pre><p><strong>recency_cltv_weekly</strong>: How many weeks between first and last purchase (customer lifecycle) <strong>T_weekly</strong>: How many weeks old is the customer (tenure) <strong>frequency</strong>: Total number of purchases <strong>monetary_cltv_avg</strong>: Average spending per transaction</p><h3>Model 1: BG-NBD (Beta Geometric/Negative Binomial Distribution)</h3><p><strong>What it predicts:</strong> Number of future transactions</p><p><strong>The assumptions:</strong></p><ul><li>Each customer has a personal purchase rate (some buy weekly, some monthly)</li><li>After each purchase, there’s a probability the customer churns</li><li>These rates vary across customers (heterogeneity)</li></ul><p><strong>Why BG-NBD specifically:</strong></p><ul><li>Handles “buy til you die” behavior</li><li>Models both purchase frequency AND dropout probability</li><li>Computationally efficient for large datasets</li></ul><pre>bgf = BetaGeoFitter(penalizer_coef=0.001)<br>bgf.fit(cltv_df[&#39;frequency&#39;],<br>        cltv_df[&#39;recency_cltv_weekly&#39;],<br>        cltv_df[&#39;T_weekly&#39;])</pre><pre># Predict next 3 and 6 months<br>cltv_df[&quot;exp_sales_3_month&quot;] = bgf.predict(4*3, ...)  # 4 weeks * 3 months<br>cltv_df[&quot;exp_sales_6_month&quot;] = bgf.predict(4*6, ...)  # 4 weeks * 6 months</pre><p>The penalizer_coef=0.001 prevents overfitting. Lower value = less regularization since the dataset is large.</p><h3>Model 2: Gamma-Gamma</h3><p><strong>What it predicts:</strong> Average monetary value per transaction</p><p><strong>Key assumption:</strong></p><ul><li>Monetary value varies randomly around each customer’s average</li><li>This variation is independent of purchase frequency</li><li>(A frequent buyer might spend little per transaction; an infrequent buyer might spend a lot)</li></ul><p><strong>Why Gamma-Gamma:</strong></p><ul><li>Designed for positive continuous values (transaction amounts)</li><li>Captures customer heterogeneity in spending</li><li>Works only with customers who have frequency &gt; 1</li></ul><pre>ggf = GammaGammaFitter(penalizer_coef=0.01)<br>ggf.fit(cltv_df[&#39;frequency&#39;], cltv_df[&#39;monetary_cltv_avg&#39;])</pre><pre>cltv_df[&quot;exp_average_value&quot;] = ggf.conditional_expected_average_profit(<br>    cltv_df[&#39;frequency&#39;],<br>    cltv_df[&#39;monetary_cltv_avg&#39;]<br>)</pre><h3>The CLTV Calculation</h3><p>Combine both models:</p><pre>cltv = ggf.customer_lifetime_value(<br>    bgf,<br>    cltv_df[&#39;frequency&#39;],<br>    cltv_df[&#39;recency_cltv_weekly&#39;],<br>    cltv_df[&#39;T_weekly&#39;],<br>    cltv_df[&#39;monetary_cltv_avg&#39;],<br>    time=6,        # 6 month prediction<br>    freq=&quot;W&quot;,      # Weekly frequency<br>    discount_rate=0.01  # 1% monthly discount<br>)</pre><p><strong>Formula essentially:</strong> CLTV = (Expected number of transactions in 6 months) × (Expected average transaction value) × (Discount factor)</p><p>The discount_rate=0.01 accounts for time value of money (future revenue is worth slightly less than present revenue).</p><h3>The 4-Tier Segmentation</h3><p>I segmented customers into quartiles:</p><pre>cltv_df[&quot;cltv_segment&quot;] = pd.qcut(cltv_df[&quot;cltv&quot;], 4, labels=[&quot;D&quot;, &quot;C&quot;, &quot;B&quot;, &quot;A&quot;])</pre><ul><li><strong>Segment A</strong>: Top 25% by predicted 6-month value</li><li><strong>Segment B</strong>: Next 25%</li><li><strong>Segment C</strong>: Next 25%</li><li><strong>Segment D</strong>: Bottom 25%</li></ul><h3>Real Application: VIP Program Selection</h3><p>Instead of guessing who deserves VIP treatment, I used data:</p><p><strong>Criteria:</strong></p><ol><li>Must be in Segment A (top predicted value)</li><li>Frequency above median (actually shops regularly)</li><li>Monetary value &gt; 75th percentile (from my data: 182.45)</li></ol><pre>vip_customers = cltv_df[<br>    (cltv_df[&#39;cltv_segment&#39;] == &#39;A&#39;) &amp;<br>    (cltv_df[&#39;frequency&#39;] &gt; cltv_df[&#39;frequency&#39;].median()) &amp;<br>    (cltv_df[&#39;monetary_cltv_avg&#39;] &gt; 182.4500)<br>]</pre><p>This created a defensible VIP list. When someone asks “why is this customer VIP?” I have three quantifiable reasons.</p><h3>Budget Allocation Strategy</h3><p>Instead of spreading budget equally:</p><ul><li><strong>Segment A</strong>: 50% of retention budget</li><li><strong>Segment B</strong>: 30%</li><li><strong>Segment C</strong>: 15%</li><li><strong>Segment D</strong>: 5%</li></ul><p>Proportional to predicted value. Segment A customers get 10x the attention of Segment D customers because the models say they’re worth it.</p><h3>Technical Implementation</h3><p><strong>Core stack:</strong></p><pre>import pandas as pd<br>import datetime as dt<br>from lifetimes import BetaGeoFitter, GammaGammaFitter</pre><p><strong>Data prep challenges:</strong></p><ul><li>Outlier handling (used IQR method at 1st/99th percentiles)</li><li>Date formatting (converted all date columns to datetime)</li><li>Creating omnichannel metrics (online + offline totals)</li><li>Handling customers with frequency = 1 (excluded from Gamma-Gamma)</li></ul><p><strong>Model calibration:</strong></p><ul><li>BG-NBD penalizer: 0.001 (minimal regularization, large dataset)</li><li>Gamma-Gamma penalizer: 0.01 (slight regularization)</li><li>Both chosen through experimentation</li></ul><h3>What I Actually Learned</h3><p><strong>RFM is underrated.</strong> Everyone talks about fancy ML models. But RFM with 10 segments gives you immediately actionable customer groups in like 30 minutes. Sometimes simple wins.</p><p><strong>Probabilistic models &gt; guessing.</strong> Before: “This customer seems valuable?” After: “This customer has 78% probability of making 3 purchases in 6 months with expected value of $247.”</p><p><strong>The lifetimes library is a gift.</strong> Implementing BG-NBD and Gamma-Gamma from scratch would take weeks. The library makes it 10 lines of code.</p><p><strong>Data prep is 60% of the work.</strong> Outliers, date formatting, creating the right metrics; that’s where I spent most time. The modeling itself was fast.</p><p><strong>Business context matters.</strong> I tried 4-segment vs 7-segment splits. Four was clearer for stakeholders. More segments = more precision but harder to action.</p><h3>The Output</h3><p><strong>From RFM:</strong></p><ul><li>10 customer segments with clear behavioral patterns</li><li>Two targeted customer lists for specific campaigns</li><li>Framework for future segmentation</li></ul><p><strong>From CLTV:</strong></p><ul><li>6-month value predictions for all customers</li><li>4-tier segmentation by predicted value</li><li>VIP program candidate list with quantifiable criteria</li><li>Budget allocation strategy backed by predictions</li></ul><p><strong>Both together:</strong></p><ul><li>Current behavior (RFM) + future value (CLTV)</li><li>Who’s valuable now + who’ll be valuable later</li><li>Precision targeting without wasting budget</li></ul><h3>Why This Matters</h3><p>Most companies treat customers uniformly. Same email campaigns, same offers, same attention. That’s inefficient.</p><p>The customers who just browsed once six months ago? They’re not coming back. Stop emailing them.</p><p>The customers who buy monthly and spend above average? They’re your revenue base. Invest accordingly.</p><p>The customers scoring high on predicted CLTV? They’re your future. Nurture them now.</p><p>It’s not complicated. It’s just math + business logic.</p><h3>Replication</h3><p>Both projects are on GitHub with full code and READMEs. Check below.</p><ul><li><a href="https://github.com/nevaerdogan/Customer_Segmentation_with_RFM">GitHub - nevaerdogan/Customer_Segmentation_with_RFM: End-to-end RFM customer segmentation pipeline: data preprocessing, metric calculation, scoring, and targeted customer identification for marketing campaigns with extra explanatory comment lines for better understanding.</a></li><li><a href="https://github.com/nevaerdogan/Customer_Lifetime_Value_Prediction">GitHub - nevaerdogan/Customer_Lifetime_Value_Prediction: Customer Lifetime Value (CLTV) prediction using BG-NBD and Gamma-Gamma models. Probabilistic modeling to forecast customer value and optimize marketing strategies.</a></li></ul><p><strong>What you need:</strong></p><ul><li>Python (pandas, lifetimes, datetime)</li><li>Customer transaction data with dates and amounts</li><li>Willingness to actually clean your data (it’s always messy)</li></ul><p><strong>What you’ll get:</strong></p><ul><li>Customer segments you can immediately action</li><li>Value predictions that inform budget allocation</li><li>Data-backed answers to “which customers matter?”</li></ul><blockquote>Let’s be real: some customers absolutely slay💅(high value, loyal, frequent), and some are just lame 😒(one-and-done, low spending, gone forever). The models separate the two so you know where to invest.</blockquote><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f024d1b4b957" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[ When Algorithms Catch Feelings: Data Science in the Wild World of Finance]]></title>
            <link>https://medium.com/@nevardogan/when-algorithms-catch-feelings-data-science-in-the-wild-world-of-finance-47058bfe56b9?source=rss-b1544ba69283------2</link>
            <guid isPermaLink="false">https://medium.com/p/47058bfe56b9</guid>
            <category><![CDATA[stock-market]]></category>
            <category><![CDATA[gen-z]]></category>
            <category><![CDATA[finance]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Neva Erdogan]]></dc:creator>
            <pubDate>Sat, 18 Oct 2025 08:42:21 GMT</pubDate>
            <atom:updated>2026-01-21T20:33:01.134Z</atom:updated>
            <content:encoded><![CDATA[<h3>💖 When Algorithms Catch Feelings: How Data Science Tries to Decode the Market’s Emotional Chaos</h3><p>There’s something oddly poetic about the stock market.<br> It’s human emotion, fear, hope, greed, all turned into numbers that dance across a screen.</p><p>And now, those numbers are mostly being watched not by people, but by algorithms.</p><p>We built machines that try to understand what moves us.<br> But the real question is: <em>can they?</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*rrfwKnL0J6ohbpmgx6XszA.png" /></figure><h3>The Emotional Side of the Market</h3><p>If you’ve ever been on finance twitter during a crash, you KNOW. The numbers are just one part of the story. Behind every candlestick chart there’s actual people panicking, celebrating, or straight up having a meltdown.</p><p>Markets aren’t logical they’re basically a social experiment where everyone’s emotions have a price tag.</p><p>That’s what makes them so fun to analyze though. Every spike is a digital footprint of collective freakout. Remember the GME thing? That was mass psychology in 4K. The silence after a crash? That’s everyone holding their breath, and yeah, the data picks up on ALL of it.</p><p>What gets me is that algorithms are actually getting decent at reading these vibes. Sentiment analysis, NLP models, even multimodal stuff now, like how BERT models can detect bullish vs bearish tone on Reddit they can catch shifts in investor mood from tweets or reddit before most traders even notice. It’s like we gave machines intuition, but in Python.</p><p>But here’s where it gets messy: emotions don’t follow rules. You can’t model FOMO the way you model inflation. That’s the challenge, trying to quantify something that was never meant to fit in a neat little dataframe.</p><h3>The Rise of Data-Driven Decisions</h3><p>Trading floors used to be loud, chaotic, people yelling. now it’s just servers humming in some data center.</p><p>But here’s the weird part:</p><p>The same algorithms we built to REMOVE human emotion? we’re now training them to DETECT it.</p><p>Like… we’re teaching machines to sense fear and optimism and confidence. Things we barely understand about ourselves.</p><p>It’s cool but also kind of dystopian? Because at the end of the day, these systems are just mirrors. They reflect our habits, our biases, our desperate belief that we can predict the unpredictable.</p><h3>When Models Fail (and Why That’s Beautiful)</h3><p>Let’s be real no matter how much data you feed them, models fail. Spectacularly sometimes. One unexpected tweet, one global event, and boom; your perfectly tuned regression model suddenly looks like it learned nothing.</p><p>But maybe that’s not a flaw. Maybe it’s the point.</p><p>Finance refuses to be predicted. every time a model breaks, it’s reminding us that the world still has variables we can’t capture gut feelings, random rumors, mass panic, hope. and that’s kinda reassuring in a weird way.</p><p>In a world of algorithmic trading and optimized everything, failure proves that chaos still exists. that humanity still matters.</p><p>What makes models interesting isn’t when they work perfectly it’s when they break and you have to figure out WHY. That’s where you learn about bias, about edge cases, about how our systems are just us trying to make sense of something that doesn’t always want to make sense.</p><p>So when a model crashes? Yeah it sucks, but it’s also a reminder that even with all this AI, uncertainty still wins. And honestly that’s kind of beautiful.</p><h3>Where It’s All Going</h3><p>Finance and data science are blending into one another ,not competing, just evolving together.<br>Markets are becoming emotional ecosystems where data tells stories and algorithms try to understand them.</p><p>We’re moving toward a world where every decision, from billion-dollar trades to your phone’s investment suggestion, will be shaped by models that learn, adapt, and maybe even anticipate our emotions. And that’s both exciting and humbling.</p><p>Because for all the sophistication of our algorithms, what truly drives the market is still us, the humans feeding data into the machine, reacting to trends, panicking, celebrating, hoping. The algorithms might be learning fast, but they’re still learning <em>from</em> us.</p><p>So, maybe the future of finance isn’t about removing emotion from data, it’s about understanding how emotion <em>creates</em> data. Maybe data science won’t just predict the market, but help us see ourselves a little more clearly in it.</p><p>And if algorithms are starting to catch feelings… maybe that’s not such a bad thing after all.</p><blockquote>✨ Author’s Note<br> Lowkey, the market is chaotic and we’re all just trying to vibe with it.<br> I hope this made you see the market a little differently and maybe feel a bit of empathy for the bots just trying to keep up with us.</blockquote><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=47058bfe56b9" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>