+ 2

Split to Achieve Gain

This is a second attempt at posing this question. I think my first attempt was unclear. First off, the problem can be found at: https://www.sololearn.com/learning/eom-project/1094/103 The code that I submitted can be found at: https://code.sololearn.com/cQEDIvXRgL3e A pedantic version of my code submission, with editable inputs and a detailed explanation of all outputs can be found at: https://code.sololearn.com/cO755SFZAUJ0 This seems like it should be a fairly simple coding task. All the functions and formulae are outputting as expected, and the two visible test cases on the problem both pass. All of the hidden tests cases fail. Since everything looks fine to me, I must be missing something. Here is the code: s = [int(x) for x in input().split()] a = [int(x) for x in input().split()] b = [int(x) for x in input().split()] # Function to get counts for set and splits, to be used in later formulae. def setCount(n): return len(n) Cs = setCount(s) Ca = setCount(a) Cb = setCount(b) # Function to get sums of "True" values in each, for later formulae. def tSum(x): sum = 0 for n in x: if n == 1: sum += 1 return sum Ss = tSum(s) Sa = tSum(a) Sb = tSum(b) # Function to get percentage of "True" values in each, for later formulae. def getp(x, n): p = x/n return p Ps = (getp(Ss, Cs)) Pa = (getp(Sa, Ca)) Pb = (getp(Sb, Cb)) # Function to get Gini impurity for each, to be used in final formula. def gimp(p): return 2 * p * (1-p) Hs = (gimp(Ps)) Ha = (gimp(Pa)) Hb = (gimp(Pb)) # Final formula, intended to output information gain to five decimal places. infoGain = round((Hs - (Sa/Ss) * Ha - (Sb/Ss) * Hb),5) print(infoGain)

python machinelearning gini

8th Jul 2022, 3:08 PM

Jimmy Tyrrell

11 Answers

+ 3

Ok so here's my train of thought. You are overthinking it a little. Saving too many variables. Let's see how we can represent the formula in python. Gini is a function of probability, a single number between 0 and 1. def gini(p): return 2 * p * (1-p) How we get this p from the data? We need to calculate it for each dataset. The data is a list of 1 and 0 values. We can simply use the built-in sum and len functions to calculate p! def p(data): return sum(data) / len(data) Then the info gain takes the size of each group again, we could even write it in single line but I find the formula like this readable. giniS = gini(p(S)) deltaA = gini(p(A)) * len(A) / len(S) deltaB = gini(p(B)) * len(B) / len(S) gain = giniS - deltaA - deltaB This checks all tests for me. https://code.sololearn.com/cUoaMq6bzxP8/?ref=app

8th Jul 2022, 8:04 PM

Tibor Santa

+ 5

Thank you, that's nice.

9th Jul 2022, 3:42 AM

Tibor Santa

+ 3

Thanks. I think I haven't done that chapter yet ☺️ but I will take a look tonight.

8th Jul 2022, 5:36 PM

Tibor Santa

+ 2

Please can you write in which course and lesson is this task? The link does not work for me.

8th Jul 2022, 5:06 PM

Tibor Santa

+ 2

@Tibor Santa: Thanks for your response! The Lesson is called "Split to Achieve Gain" and it's the end-of-module project in the "Decision Trees" module of the Machine Learning course. The instructions are below: Machine Learning - Split to Achieve Gain Calculate Information Gain. Task Given a dataset and a split of the dataset, calculate the information gain using the Gini Impurity. The first line of the input is a list of the target values in the initial dataset. The second line is the target values of the left split and the third line is the target values of the right split. Round your result to 5 decimal places. You can use round(x, 5). Input Format Three lines of 1's and 0's separated by spaces Output Format Float (rounded to 5 decimal places) Sample Input 1 0 1 0 1 0 1 1 1 0 0 0 Sample Output 0.5

8th Jul 2022, 5:34 PM

Jimmy Tyrrell

+ 2

To be honest, I don't really see any logical flaw in your code. Looks ok and after all it clears two tests. Floating point calculations are rather delicate. You may not get the exact same result if you even change the order of the operands. So I think it is not really fair from the exercise to ask for an exact match, especially because hidden tests are involved. Anyway... We can't really change that.

8th Jul 2022, 8:23 PM

Tibor Santa

+ 2

Tibor Santa your solution worked for me, thanks! I'm still going to mess around with what I wrote and see if I can make it work though. I'd like to see how large of variances I can achieve by slight changes in how I write the code. It's on the to-do list, but for now I'll keep moving forward.

8th Jul 2022, 8:42 PM

Jimmy Tyrrell

+ 2

Tibor Santa I hope you don't mind, but I put your code into an answer at Stack Overflow, where I'd asked the same question. I attributed the solution to you, of course :) https://stackoverflow.com/questions/72914385/how-can-i-improve-this-JUMP_LINK__&&__python__&&__JUMP_LINK-code-to-calculate-information-gain-from-gini-impur/

8th Jul 2022, 9:17 PM

Jimmy Tyrrell

+ 2

ANS : S = [int(x) for x in input().split()] A = [int(x) for x in input().split()] B = [int(x) for x in input().split()] p1=sum(S)/len(S) imp1=2*p1*(1-p1) p2=sum(A)/len(A) imp2=2*p2*(1-p2) p3=sum(B)/len(B) imp3=2*p3*(1-p3) i=imp1-len(A)/len(S)*imp2-len(B)/len(S)*imp3 print(round(i,5))

25th Jul 2022, 11:30 AM

Reza Zeraat Kar

+ 1

Update: I added round(x , 5) to the return statements for the getp and gimp functions, thinking it might massage the results in the hidden test cases and cause at least one of them to pass. It had no visible effect on any of them.

8th Jul 2022, 5:58 PM

Jimmy Tyrrell

+ 1

Tibor Santa thanks! I'll check it once I get back to my laptop. Does making lots of small functions and storing them in variables, vs writing a smaller volume of more comprehensive functions, potentially change the values for long floats? We couldn't see the hidden test cases, but I'm assuming they were all five decimals long.

8th Jul 2022, 8:14 PM

Jimmy Tyrrell